Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU

Dan Van Der Ster Wed, 17 Sep 2014 08:42:49 -0700

Hi,
(Sorry for top posting, mobile now).

That's exactly what I observe -- one sleep per PG. The problem is that the 
sleep can't simply be moved since AFAICT the whole PG is locked for the 
duration of the trimmer. So the options I proposed are to limit the number of 
snaps trimmed per call to e.g 16, or to fix the loss of purged_snaps after 
backfilling. Actually, probably both of those are needed. But a real dev would 
know better.


Cheers, Dan


From: Florian Haas <[email protected]>
Sent: Sep 17, 2014 5:33 PM
To: Dan Van Der Ster
Cc: Craig Lewis <[email protected]>;[email protected]
Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU

On Wed, Sep 17, 2014 at 5:24 PM, Dan Van Der Ster
<[email protected]> wrote:
> Hi Florian,
>
>> On 17 Sep 2014, at 17:09, Florian Haas <[email protected]> wrote:
>>
>> Hi Craig,
>>
>> just dug this up in the list archives.
>>
>> On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis <[email protected]> 
>> wrote:
>>> In the interest of removing variables, I removed all snapshots on all pools,
>>> then restarted all ceph daemons at the same time.  This brought up osd.8 as
>>> well.
>>
>> So just to summarize this: your 100% CPU problem at the time went away
>> after you removed all snapshots, and the actual cause of the issue was
>> never found?
>>
>> I am seeing a similar issue now, and have filed
>> http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost
>> again. Can you take a look at that issue and let me know if anything
>> in the description sounds familiar?
>
>
> Could your ticket be related to the snap trimming issue I’ve finally narrowed 
> down in the past couple days?
>
>   http://tracker.ceph.com/issues/9487
>
> Bump up debug_osd to 20 then check the log during one of your incidents. If 
> it is busy logging the snap_trimmer messages, then it’s the same issue. (The 
> issue is that rbd pools have many purged_snaps, but sometimes after 
> backfilling a PG the purged_snaps list is lost and thus the snap trimmer 
> becomes very busy whilst re-trimming thousands of snaps. During that time (a 
> few minutes on my cluster) the OSD is blocked.)

That sounds promising, thank you! debug_osd=10 should actually be
sufficient as those snap_trim messages get logged at that level. :)

Do I understand your issue report correctly in that you have found
setting osd_snap_trim_sleep to be ineffective, because it's being
applied when iterating from PG to PG, rather than from snap to snap?
If so, then I'm guessing that that can hardly be intentional...

Cheers,
Florian

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU

Reply via email to