Looking further, i guess what i tried to tell was a simplified version of sharded threadpools, released in giant. Is it possible for that to be backported to firefly?
On Tue, Mar 3, 2015 at 9:33 AM, Erdem Agaoglu <[email protected]> wrote: > Thank you folks for bringing that up. I had some questions about sharding. > We'd like blind buckets too, at least it's on the roadmap. For the current > sharded implementation, what are the final details? Is number of shards > defined per bucket or globally? Is there a way to split current indexes > into shards? > > On the other hand what i'd like to point here is not necessarily > large-bucket-index specific. The problem is the mechanism around thread > pools. Any request may require locks on a pg and this should not block the > requests for other pgs. I'm no expert but the threads may be able to > requeue the requests to a locked pg, processing others for other pgs. Or > maybe a thread per pg design was possible. Because, you know, it is > somewhat OK not being able to do anything for a locked resource. Then you > can go and improve your processing or your locks. But it's a whole > different problem when a locked pg blocks requests for a few hundred other > pgs in other pools for no good reason. > > On Tue, Mar 3, 2015 at 5:43 AM, Ben Hines <[email protected]> wrote: > >> Blind-bucket would be perfect for us, as we don't need to list the >> objects. >> >> We only need to list the bucket when doing a bucket deletion. If we >> could clean out/delete all objects in a bucket (without >> iterating/listing them) that would be ideal.. >> >> On Mon, Mar 2, 2015 at 7:34 PM, GuangYang <[email protected]> wrote: >> > We have had good experience so far keeping each bucket less than 0.5 >> million objects, by client side sharding. But I think it would be nice you >> can test at your scale, with your hardware configuration, as well as your >> expectation over the tail latency. >> > >> > Generally the bucket sharding should help, both for Write throughput >> and *stall with recovering/scrubbing*, but it comes with a prices - The X >> shards you have for each bucket, the listing/trimming would be X times >> weighted, from OSD's load's point of view. There was discussion to >> implement: 1) blind bucket (for use cases bucket listing is not needed). 2) >> Un-ordered listing, which could improve the problem I mentioned above. They >> are on the roadmap... >> > >> > Thanks, >> > Guang >> > >> > >> > ---------------------------------------- >> >> From: [email protected] >> >> Date: Mon, 2 Mar 2015 18:13:25 -0800 >> >> To: [email protected] >> >> CC: [email protected] >> >> Subject: Re: [ceph-users] Some long running ops may lock osd >> >> >> >> We're seeing a lot of this as well. (as i mentioned to sage at >> >> SCALE..) Is there a rule of thumb at all for how big is safe to let a >> >> RGW bucket get? >> >> >> >> Also, is this theoretically resolved by the new bucket-sharding >> >> feature in the latest dev release? >> >> >> >> -Ben >> >> >> >> On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu < >> [email protected]> wrote: >> >>> Hi Gregory, >> >>> >> >>> We are not using listomapkeys that way or in any way to be precise. I >> used >> >>> it here just to reproduce the behavior/issue. >> >>> >> >>> What i am really interested in is if scrubbing-deep actually >> mitigates the >> >>> problem and/or is there something that can be further improved. >> >>> >> >>> Or i guess we should go upgrade now and hope for the best :) >> >>> >> >>> On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum <[email protected]> >> wrote: >> >>>> >> >>>> On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu < >> [email protected]> >> >>>> wrote: >> >>>>> Hi all, especially devs, >> >>>>> >> >>>>> We have recently pinpointed one of the causes of slow requests in >> our >> >>>>> cluster. It seems deep-scrubs on pg's that contain the index file >> for a >> >>>>> large radosgw bucket lock the osds. Incresing op threads and/or disk >> >>>>> threads >> >>>>> helps a little bit, but we need to increase them beyond reason in >> order >> >>>>> to >> >>>>> completely get rid of the problem. A somewhat similar (and more >> severe) >> >>>>> version of the issue occurs when we call listomapkeys for the index >> >>>>> file, >> >>>>> and since the logs for deep-scrubbing was much harder read, this >> >>>>> inspection >> >>>>> was based on listomapkeys. >> >>>>> >> >>>>> In this example osd.121 is the primary of pg 10.c91 which contains >> file >> >>>>> .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket >> contains >> >>>>> ~500k objects. Standard listomapkeys call take about 3 seconds. >> >>>>> >> >>>>> time rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null >> >>>>> real 0m2.983s >> >>>>> user 0m0.760s >> >>>>> sys 0m0.148s >> >>>>> >> >>>>> In order to lock the osd we request 2 of them simultaneously with >> >>>>> something >> >>>>> like: >> >>>>> >> >>>>> rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null & >> >>>>> sleep 1 >> >>>>> rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null & >> >>>>> >> >>>>> 'debug_osd=30' logs show the flow like: >> >>>>> >> >>>>> At t0 some thread enqueue_op's my omap-get-keys request. >> >>>>> Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading >> ~500k >> >>>>> keys. >> >>>>> Op-Thread B responds to several other requests during that 1 second >> >>>>> sleep. >> >>>>> They're generally extremely fast subops on other pgs. >> >>>>> At t1 (about a second later) my second omap-get-keys request gets >> >>>>> enqueue_op'ed. But it does not start probably because of the lock >> held >> >>>>> by >> >>>>> Thread A. >> >>>>> After that point other threads enqueue_op other requests on other >> pgs >> >>>>> too >> >>>>> but none of them starts processing, in which i consider the osd is >> >>>>> locked. >> >>>>> At t2 (about another second later) my first omap-get-keys request is >> >>>>> finished. >> >>>>> Op-Thread B locks pg 10.c91 and dequeue_op's my second request and >> >>>>> starts >> >>>>> reading ~500k keys again. >> >>>>> Op-Thread A continues to process the requests enqueued in t1-t2. >> >>>>> >> >>>>> It seems Op-Thread B is waiting on the lock held by Op-Thread A >> while it >> >>>>> can >> >>>>> process other requests for other pg's just fine. >> >>>>> >> >>>>> My guess is a somewhat larger scenario happens in deep-scrubbing, >> like >> >>>>> on >> >>>>> the pg containing index for the bucket of>20M objects. A disk/op >> thread >> >>>>> starts reading through the omap which will take say 60 seconds. >> During >> >>>>> the >> >>>>> first seconds, other requests for other pgs pass just fine. But in >> 60 >> >>>>> seconds there are bound to be other requests for the same pg, >> especially >> >>>>> since it holds the index file. Each of these requests lock another >> >>>>> disk/op >> >>>>> thread to the point where there are no free threads left to process >> any >> >>>>> requests for any pg. Causing slow-requests. >> >>>>> >> >>>>> So first of all thanks if you can make it here, and sorry for the >> >>>>> involved >> >>>>> mail, i'm exploring the problem as i go. >> >>>>> Now, is that deep-scrubbing situation i tried to theorize even >> possible? >> >>>>> If >> >>>>> not can you point us where to look further. >> >>>>> We are currently running 0.72.2 and know about newer ioprio >> settings in >> >>>>> Firefly and such. While we are planning to upgrade in a few weeks >> but i >> >>>>> don't think those options will help us in any way. Am i correct? >> >>>>> Are there any other improvements that we are not aware? >> >>>> >> >>>> This is all basically correct; it's one of the reasons you don't want >> >>>> to let individual buckets get too large. >> >>>> >> >>>> That said, I'm a little confused about why you're running >> listomapkeys >> >>>> that way. RGW throttles itself by getting only a certain number of >> >>>> entries at a time (1000?) and any system you're also building should >> >>>> do the same. That would reduce the frequency of any issues, and I >> >>>> *think* that scrubbing has some mitigating factors to help (although >> >>>> maybe not; it's been a while since I looked at any of that stuff). >> >>>> >> >>>> Although I just realized that my vague memory of deep scrubbing >> >>>> working better might be based on improvements that only got in for >> >>>> firefly...not sure. >> >>>> -Greg >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> erdem agaoglu >> >>> >> >>> _______________________________________________ >> >>> ceph-users mailing list >> >>> [email protected] >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >>> >> >> _______________________________________________ >> >> ceph-users mailing list >> >> [email protected] >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > > > > -- > erdem agaoglu > -- erdem agaoglu
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
