Can you advise on what the issues may be?

Yehuda Sadeh <[email protected]> wrote:
>On Wed, Jan 22, 2014 at 8:55 AM, Graeme Lambert <[email protected]>
>wrote:
>> Hi Yehuda,
>>
>> With regards to the health status of the cluster, it isn't healthy
>but I
>> haven't found any way of fixing the placement group errors.  Looking
>at the
>> ceph health detail it's also showing blocked requests too?
>>
>> HEALTH_WARN 1 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs
>stuck
>> unclean; 7 requests are blocked > 32 sec; 3 osds have slow requests;
>pool
>> cloudstack has too few pgs; pool .rgw.buckets has too few pgs
>> pg 14.0 is stuck inactive since forever, current state incomplete,
>last
>> acting [5,0]
>> pg 14.2 is stuck inactive since forever, current state incomplete,
>last
>> acting [0,5]
>> pg 14.6 is stuck inactive since forever, current state
>down+incomplete, last
>> acting [4,2]
>> pg 14.0 is stuck unclean since forever, current state incomplete,
>last
>> acting [5,0]
>> pg 14.2 is stuck unclean since forever, current state incomplete,
>last
>> acting [0,5]
>> pg 14.6 is stuck unclean since forever, current state
>down+incomplete, last
>> acting [4,2]
>> pg 14.0 is incomplete, acting [5,0]
>> pg 14.2 is incomplete, acting [0,5]
>> pg 14.6 is down+incomplete, acting [4,2]
>
>
>You should figure these first before trying to get the gateway
>working. May very well be your culprit.
>
>Yehuda
>
>> 3 ops are blocked > 8388.61 sec
>> 3 ops are blocked > 4194.3 sec
>> 1 ops are blocked > 2097.15 sec
>> 1 ops are blocked > 8388.61 sec on osd.0
>> 1 ops are blocked > 4194.3 sec on osd.0
>> 2 ops are blocked > 8388.61 sec on osd.4
>> 2 ops are blocked > 4194.3 sec on osd.5
>> 1 ops are blocked > 2097.15 sec on osd.5
>> 3 osds have slow requests
>> pool cloudstack objects per pg (37316) is more than 27.1587 times
>cluster
>> average (1374)
>> pool .rgw.buckets objects per pg (76219) is more than 55.4723 times
>cluster
>> average (1374)
>>
>>
>> Ignore the cloudstack pool, I was using cloudstack but not anymore,
>it's an
>> inactive pool.
>>
>> Best regards
>>
>> Graeme
>>
>>
>>
>> On 22/01/14 16:38, Graeme Lambert wrote:
>>
>> Hi,
>>
>> Following discussions with people in the IRC I set debug_ms and this
>is what
>> is being looped over and over when one of them is stuck:
>> http://pastebin.com/KVcpAeYT
>>
>> Regarding the modules, apache version is 2.2.22-2precise.ceph and the
>> fastcgi mod version is 2.4.7~0910052141-2~bpo70+1.ceph.
>>
>> Best regards
>>
>> Graeme
>>
>>
>> On 22/01/14 16:28, Yehuda Sadeh wrote:
>>
>> On Wed, Jan 22, 2014 at 8:05 AM, Graeme Lambert
><[email protected]>
>> wrote:
>>
>> Hi,
>>
>> I'm using the aws-sdk-for-php classes for the Ceph RADOS gateway but
>I'm
>> getting an intermittent issue with the uploading files.
>>
>> I'm attempting to upload an array of objects to Ceph one by one using
>the
>> create_object() function.  It appears to stop randomly when
>attempting to do
>> them all, it could stop at the first one, in between or the last one,
>there
>> is no pattern to it that I can see.
>>
>> I'm not getting any PHP errors that indicate an issue and equally
>there are
>> no exceptions being caught.
>>
>> In the radosgw log file, at the time it appears stuck I get:
>>
>> 2014-01-22 15:39:21.656763 7fac44fe1700  1 ====== starting new
>request
>> req=0x2417c30 =====
>>
>> And then sometimes I see:
>>
>> 2014-01-22 15:40:42.490485 7fac99ff9700  1 heartbeat_map is_healthy
>> 'RGWProcess::m_tp thread 0x7fac51ffb700' had timed out after 600
>>
>> repeated over and over again.
>>
>> When those messages are appearing, Apache's error log shows:
>>
>> [Wed Jan 22 15:43:11 2014] [error] [client 172.16.2.149] FastCGI:
>comm with
>> server "/var/www/s3gw.fcgi" aborted: idle timeout (30 sec), referer:
>> https://[server]/[path]
>>
>> equally over and over again.
>>
>> I have restarted apache, radosgw, all Ceph OSDs and ceph-mon
>processes and
>> still no joy with this.
>>
>> Can anyone advise on where I'm going wrong with this?
>>
>> Which fastcgi module are you using? Can you provide a log with 'debug
>> ms = 1' for a failing request? Usually that kind of message means
>that
>> it's waiting for the osd to response, which might point at an
>> unhealthy cluster.
>>
>> Yehuda
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to