Re: [ceph-users] Some long running ops may lock osd

Erdem Agaoglu Tue, 03 Mar 2015 01:21:05 -0800

Looking further, i guess what i tried to tell was a simplified version of
sharded threadpools, released in giant. Is it possible for that to be
backported to firefly?


On Tue, Mar 3, 2015 at 9:33 AM, Erdem Agaoglu <[email protected]>
wrote:

> Thank you folks for bringing that up. I had some questions about sharding.
> We'd like blind buckets too, at least it's on the roadmap. For the current
> sharded implementation, what are the final details? Is number of shards
> defined per bucket or globally? Is there a way to split current indexes
> into shards?
>
> On the other hand what i'd like to point here is not necessarily
> large-bucket-index specific. The problem is the mechanism around thread
> pools. Any request may require locks on a pg and this should not block the
> requests for other pgs. I'm no expert but the threads may be able to
> requeue the requests to a locked pg, processing others for other pgs. Or
> maybe a thread per pg design was possible. Because, you know, it is
> somewhat OK not being able to do anything for a locked resource. Then you
> can go and improve your processing or your locks. But it's a whole
> different problem when a locked pg blocks requests for a few hundred other
> pgs in other pools for no good reason.
>
> On Tue, Mar 3, 2015 at 5:43 AM, Ben Hines <[email protected]> wrote:
>
>> Blind-bucket would be perfect for us, as we don't need to list the
>> objects.
>>
>> We only need to list the bucket when doing a bucket deletion. If we
>> could clean out/delete all objects in a bucket (without
>> iterating/listing them) that would be ideal..
>>
>> On Mon, Mar 2, 2015 at 7:34 PM, GuangYang <[email protected]> wrote:
>> > We have had good experience so far keeping each bucket less than 0.5
>> million objects, by client side sharding. But I think it would be nice you
>> can test at your scale, with your hardware configuration, as well as your
>> expectation over the tail latency.
>> >
>> > Generally the bucket sharding should help, both for Write throughput
>> and *stall with recovering/scrubbing*, but it comes with a prices -  The X
>> shards you have for each bucket, the listing/trimming would be X times
>> weighted, from OSD's load's point of view. There was discussion to
>> implement: 1) blind bucket (for use cases bucket listing is not needed). 2)
>> Un-ordered listing, which could improve the problem I mentioned above. They
>> are on the roadmap...
>> >
>> > Thanks,
>> > Guang
>> >
>> >
>> > ----------------------------------------
>> >> From: [email protected]
>> >> Date: Mon, 2 Mar 2015 18:13:25 -0800
>> >> To: [email protected]
>> >> CC: [email protected]
>> >> Subject: Re: [ceph-users] Some long running ops may lock osd
>> >>
>> >> We're seeing a lot of this as well. (as i mentioned to sage at
>> >> SCALE..) Is there a rule of thumb at all for how big is safe to let a
>> >> RGW bucket get?
>> >>
>> >> Also, is this theoretically resolved by the new bucket-sharding
>> >> feature in the latest dev release?
>> >>
>> >> -Ben
>> >>
>> >> On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu <
>> [email protected]> wrote:
>> >>> Hi Gregory,
>> >>>
>> >>> We are not using listomapkeys that way or in any way to be precise. I
>> used
>> >>> it here just to reproduce the behavior/issue.
>> >>>
>> >>> What i am really interested in is if scrubbing-deep actually
>> mitigates the
>> >>> problem and/or is there something that can be further improved.
>> >>>
>> >>> Or i guess we should go upgrade now and hope for the best :)
>> >>>
>> >>> On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum <[email protected]>
>> wrote:
>> >>>>
>> >>>> On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu <
>> [email protected]>
>> >>>> wrote:
>> >>>>> Hi all, especially devs,
>> >>>>>
>> >>>>> We have recently pinpointed one of the causes of slow requests in
>> our
>> >>>>> cluster. It seems deep-scrubs on pg's that contain the index file
>> for a
>> >>>>> large radosgw bucket lock the osds. Incresing op threads and/or disk
>> >>>>> threads
>> >>>>> helps a little bit, but we need to increase them beyond reason in
>> order
>> >>>>> to
>> >>>>> completely get rid of the problem. A somewhat similar (and more
>> severe)
>> >>>>> version of the issue occurs when we call listomapkeys for the index
>> >>>>> file,
>> >>>>> and since the logs for deep-scrubbing was much harder read, this
>> >>>>> inspection
>> >>>>> was based on listomapkeys.
>> >>>>>
>> >>>>> In this example osd.121 is the primary of pg 10.c91 which contains
>> file
>> >>>>> .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket
>> contains
>> >>>>> ~500k objects. Standard listomapkeys call take about 3 seconds.
>> >>>>>
>> >>>>> time rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null
>> >>>>> real 0m2.983s
>> >>>>> user 0m0.760s
>> >>>>> sys 0m0.148s
>> >>>>>
>> >>>>> In order to lock the osd we request 2 of them simultaneously with
>> >>>>> something
>> >>>>> like:
>> >>>>>
>> >>>>> rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null &
>> >>>>> sleep 1
>> >>>>> rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null &
>> >>>>>
>> >>>>> 'debug_osd=30' logs show the flow like:
>> >>>>>
>> >>>>> At t0 some thread enqueue_op's my omap-get-keys request.
>> >>>>> Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading
>> ~500k
>> >>>>> keys.
>> >>>>> Op-Thread B responds to several other requests during that 1 second
>> >>>>> sleep.
>> >>>>> They're generally extremely fast subops on other pgs.
>> >>>>> At t1 (about a second later) my second omap-get-keys request gets
>> >>>>> enqueue_op'ed. But it does not start probably because of the lock
>> held
>> >>>>> by
>> >>>>> Thread A.
>> >>>>> After that point other threads enqueue_op other requests on other
>> pgs
>> >>>>> too
>> >>>>> but none of them starts processing, in which i consider the osd is
>> >>>>> locked.
>> >>>>> At t2 (about another second later) my first omap-get-keys request is
>> >>>>> finished.
>> >>>>> Op-Thread B locks pg 10.c91 and dequeue_op's my second request and
>> >>>>> starts
>> >>>>> reading ~500k keys again.
>> >>>>> Op-Thread A continues to process the requests enqueued in t1-t2.
>> >>>>>
>> >>>>> It seems Op-Thread B is waiting on the lock held by Op-Thread A
>> while it
>> >>>>> can
>> >>>>> process other requests for other pg's just fine.
>> >>>>>
>> >>>>> My guess is a somewhat larger scenario happens in deep-scrubbing,
>> like
>> >>>>> on
>> >>>>> the pg containing index for the bucket of>20M objects. A disk/op
>> thread
>> >>>>> starts reading through the omap which will take say 60 seconds.
>> During
>> >>>>> the
>> >>>>> first seconds, other requests for other pgs pass just fine. But in
>> 60
>> >>>>> seconds there are bound to be other requests for the same pg,
>> especially
>> >>>>> since it holds the index file. Each of these requests lock another
>> >>>>> disk/op
>> >>>>> thread to the point where there are no free threads left to process
>> any
>> >>>>> requests for any pg. Causing slow-requests.
>> >>>>>
>> >>>>> So first of all thanks if you can make it here, and sorry for the
>> >>>>> involved
>> >>>>> mail, i'm exploring the problem as i go.
>> >>>>> Now, is that deep-scrubbing situation i tried to theorize even
>> possible?
>> >>>>> If
>> >>>>> not can you point us where to look further.
>> >>>>> We are currently running 0.72.2 and know about newer ioprio
>> settings in
>> >>>>> Firefly and such. While we are planning to upgrade in a few weeks
>> but i
>> >>>>> don't think those options will help us in any way. Am i correct?
>> >>>>> Are there any other improvements that we are not aware?
>> >>>>
>> >>>> This is all basically correct; it's one of the reasons you don't want
>> >>>> to let individual buckets get too large.
>> >>>>
>> >>>> That said, I'm a little confused about why you're running
>> listomapkeys
>> >>>> that way. RGW throttles itself by getting only a certain number of
>> >>>> entries at a time (1000?) and any system you're also building should
>> >>>> do the same. That would reduce the frequency of any issues, and I
>> >>>> *think* that scrubbing has some mitigating factors to help (although
>> >>>> maybe not; it's been a while since I looked at any of that stuff).
>> >>>>
>> >>>> Although I just realized that my vague memory of deep scrubbing
>> >>>> working better might be based on improvements that only got in for
>> >>>> firefly...not sure.
>> >>>> -Greg
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> erdem agaoglu
>> >>>
>> >>> _______________________________________________
>> >>> ceph-users mailing list
>> >>> [email protected]
>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> [email protected]
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>
>
>
> --
> erdem agaoglu
>



-- 
erdem agaoglu

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Some long running ops may lock osd

Reply via email to