We have had good experience so far keeping each bucket less than 0.5 million 
objects, by client side sharding. But I think it would be nice you can test at 
your scale, with your hardware configuration, as well as your expectation over 
the tail latency.

Generally the bucket sharding should help, both for Write throughput and *stall 
with recovering/scrubbing*, but it comes with a prices -  The X shards you have 
for each bucket, the listing/trimming would be X times weighted, from OSD's 
load's point of view. There was discussion to implement: 1) blind bucket (for 
use cases bucket listing is not needed). 2) Un-ordered listing, which could 
improve the problem I mentioned above. They are on the roadmap...

Thanks,
Guang


----------------------------------------
> From: bhi...@gmail.com
> Date: Mon, 2 Mar 2015 18:13:25 -0800
> To: erdem.agao...@gmail.com
> CC: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Some long running ops may lock osd
>
> We're seeing a lot of this as well. (as i mentioned to sage at
> SCALE..) Is there a rule of thumb at all for how big is safe to let a
> RGW bucket get?
>
> Also, is this theoretically resolved by the new bucket-sharding
> feature in the latest dev release?
>
> -Ben
>
> On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu <erdem.agao...@gmail.com> 
> wrote:
>> Hi Gregory,
>>
>> We are not using listomapkeys that way or in any way to be precise. I used
>> it here just to reproduce the behavior/issue.
>>
>> What i am really interested in is if scrubbing-deep actually mitigates the
>> problem and/or is there something that can be further improved.
>>
>> Or i guess we should go upgrade now and hope for the best :)
>>
>> On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum <g...@gregs42.com> wrote:
>>>
>>> On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu <erdem.agao...@gmail.com>
>>> wrote:
>>>> Hi all, especially devs,
>>>>
>>>> We have recently pinpointed one of the causes of slow requests in our
>>>> cluster. It seems deep-scrubs on pg's that contain the index file for a
>>>> large radosgw bucket lock the osds. Incresing op threads and/or disk
>>>> threads
>>>> helps a little bit, but we need to increase them beyond reason in order
>>>> to
>>>> completely get rid of the problem. A somewhat similar (and more severe)
>>>> version of the issue occurs when we call listomapkeys for the index
>>>> file,
>>>> and since the logs for deep-scrubbing was much harder read, this
>>>> inspection
>>>> was based on listomapkeys.
>>>>
>>>> In this example osd.121 is the primary of pg 10.c91 which contains file
>>>> .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains
>>>> ~500k objects. Standard listomapkeys call take about 3 seconds.
>>>>
>>>> time rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null
>>>> real 0m2.983s
>>>> user 0m0.760s
>>>> sys 0m0.148s
>>>>
>>>> In order to lock the osd we request 2 of them simultaneously with
>>>> something
>>>> like:
>>>>
>>>> rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null &
>>>> sleep 1
>>>> rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null &
>>>>
>>>> 'debug_osd=30' logs show the flow like:
>>>>
>>>> At t0 some thread enqueue_op's my omap-get-keys request.
>>>> Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k
>>>> keys.
>>>> Op-Thread B responds to several other requests during that 1 second
>>>> sleep.
>>>> They're generally extremely fast subops on other pgs.
>>>> At t1 (about a second later) my second omap-get-keys request gets
>>>> enqueue_op'ed. But it does not start probably because of the lock held
>>>> by
>>>> Thread A.
>>>> After that point other threads enqueue_op other requests on other pgs
>>>> too
>>>> but none of them starts processing, in which i consider the osd is
>>>> locked.
>>>> At t2 (about another second later) my first omap-get-keys request is
>>>> finished.
>>>> Op-Thread B locks pg 10.c91 and dequeue_op's my second request and
>>>> starts
>>>> reading ~500k keys again.
>>>> Op-Thread A continues to process the requests enqueued in t1-t2.
>>>>
>>>> It seems Op-Thread B is waiting on the lock held by Op-Thread A while it
>>>> can
>>>> process other requests for other pg's just fine.
>>>>
>>>> My guess is a somewhat larger scenario happens in deep-scrubbing, like
>>>> on
>>>> the pg containing index for the bucket of>20M objects. A disk/op thread
>>>> starts reading through the omap which will take say 60 seconds. During
>>>> the
>>>> first seconds, other requests for other pgs pass just fine. But in 60
>>>> seconds there are bound to be other requests for the same pg, especially
>>>> since it holds the index file. Each of these requests lock another
>>>> disk/op
>>>> thread to the point where there are no free threads left to process any
>>>> requests for any pg. Causing slow-requests.
>>>>
>>>> So first of all thanks if you can make it here, and sorry for the
>>>> involved
>>>> mail, i'm exploring the problem as i go.
>>>> Now, is that deep-scrubbing situation i tried to theorize even possible?
>>>> If
>>>> not can you point us where to look further.
>>>> We are currently running 0.72.2 and know about newer ioprio settings in
>>>> Firefly and such. While we are planning to upgrade in a few weeks but i
>>>> don't think those options will help us in any way. Am i correct?
>>>> Are there any other improvements that we are not aware?
>>>
>>> This is all basically correct; it's one of the reasons you don't want
>>> to let individual buckets get too large.
>>>
>>> That said, I'm a little confused about why you're running listomapkeys
>>> that way. RGW throttles itself by getting only a certain number of
>>> entries at a time (1000?) and any system you're also building should
>>> do the same. That would reduce the frequency of any issues, and I
>>> *think* that scrubbing has some mitigating factors to help (although
>>> maybe not; it's been a while since I looked at any of that stuff).
>>>
>>> Although I just realized that my vague memory of deep scrubbing
>>> working better might be based on improvements that only got in for
>>> firefly...not sure.
>>> -Greg
>>
>>
>>
>>
>> --
>> erdem agaoglu
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
                                          
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to