Hi Gregory,

We are not using listomapkeys that way or in any way to be precise. I used
it here just to reproduce the behavior/issue.

What i am really interested in is if scrubbing-deep actually mitigates the
problem and/or is there something that can be further improved.

Or i guess we should go upgrade now and hope for the best :)

On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum <[email protected]> wrote:

> On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu <[email protected]>
> wrote:
> > Hi all, especially devs,
> >
> > We have recently pinpointed one of the causes of slow requests in our
> > cluster. It seems deep-scrubs on pg's that contain the index file for a
> > large radosgw bucket lock the osds. Incresing op threads and/or disk
> threads
> > helps a little bit, but we need to increase them beyond reason in order
> to
> > completely get rid of the problem. A somewhat similar (and more severe)
> > version of the issue occurs when we call listomapkeys for the index file,
> > and since the logs for deep-scrubbing was much harder read, this
> inspection
> > was based on listomapkeys.
> >
> > In this example osd.121 is the primary of pg 10.c91 which contains file
> > .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains
> > ~500k objects. Standard listomapkeys call take about 3 seconds.
> >
> > time rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null
> > real 0m2.983s
> > user 0m0.760s
> > sys 0m0.148s
> >
> > In order to lock the osd we request 2 of them simultaneously with
> something
> > like:
> >
> > rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null &
> > sleep 1
> > rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null &
> >
> > 'debug_osd=30' logs show the flow like:
> >
> > At t0 some thread enqueue_op's my omap-get-keys request.
> > Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k
> > keys.
> > Op-Thread B responds to several other requests during that 1 second
> sleep.
> > They're generally extremely fast subops on other pgs.
> > At t1 (about a second later) my second omap-get-keys request gets
> > enqueue_op'ed. But it does not start probably because of the lock held by
> > Thread A.
> > After that point other threads enqueue_op other requests on other pgs too
> > but none of them starts processing, in which i consider the osd is
> locked.
> > At t2 (about another second later) my first omap-get-keys request is
> > finished.
> > Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts
> > reading ~500k keys again.
> > Op-Thread A continues to process the requests enqueued in t1-t2.
> >
> > It seems Op-Thread B is waiting on the lock held by Op-Thread A while it
> can
> > process other requests for other pg's just fine.
> >
> > My guess is a somewhat larger scenario happens in deep-scrubbing, like on
> > the pg containing index for the bucket of >20M objects. A disk/op thread
> > starts reading through the omap which will take say 60 seconds. During
> the
> > first seconds, other requests for other pgs pass just fine. But in 60
> > seconds there are bound to be other requests for the same pg, especially
> > since it holds the index file. Each of these requests lock another
> disk/op
> > thread to the point where there are no free threads left to process any
> > requests for any pg. Causing slow-requests.
> >
> > So first of all thanks if you can make it here, and sorry for the
> involved
> > mail, i'm exploring the problem as i go.
> > Now, is that deep-scrubbing situation i tried to theorize even possible?
> If
> > not can you point us where to look further.
> > We are currently running 0.72.2 and know about newer ioprio settings in
> > Firefly and such. While we are planning to upgrade in a few weeks but i
> > don't think those options will help us in any way. Am i correct?
> > Are there any other improvements that we are not aware?
>
> This is all basically correct; it's one of the reasons you don't want
> to let individual buckets get too large.
>
> That said, I'm a little confused about why you're running listomapkeys
> that way. RGW throttles itself by getting only a certain number of
> entries at a time (1000?) and any system you're also building should
> do the same. That would reduce the frequency of any issues, and I
> *think* that scrubbing has some mitigating factors to help (although
> maybe not; it's been a while since I looked at any of that stuff).
>
> Although I just realized that my vague memory of deep scrubbing
> working better might be based on improvements that only got in for
> firefly...not sure.
> -Greg
>



-- 
erdem agaoglu
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to