Re: [ceph-users] Some long running ops may lock osd

2015-03-03 Thread Erdem Agaoglu
Looking further, i guess what i tried to tell was a simplified version of
sharded threadpools, released in giant. Is it possible for that to be
backported to firefly?

On Tue, Mar 3, 2015 at 9:33 AM, Erdem Agaoglu erdem.agao...@gmail.com
wrote:

 Thank you folks for bringing that up. I had some questions about sharding.
 We'd like blind buckets too, at least it's on the roadmap. For the current
 sharded implementation, what are the final details? Is number of shards
 defined per bucket or globally? Is there a way to split current indexes
 into shards?

 On the other hand what i'd like to point here is not necessarily
 large-bucket-index specific. The problem is the mechanism around thread
 pools. Any request may require locks on a pg and this should not block the
 requests for other pgs. I'm no expert but the threads may be able to
 requeue the requests to a locked pg, processing others for other pgs. Or
 maybe a thread per pg design was possible. Because, you know, it is
 somewhat OK not being able to do anything for a locked resource. Then you
 can go and improve your processing or your locks. But it's a whole
 different problem when a locked pg blocks requests for a few hundred other
 pgs in other pools for no good reason.

 On Tue, Mar 3, 2015 at 5:43 AM, Ben Hines bhi...@gmail.com wrote:

 Blind-bucket would be perfect for us, as we don't need to list the
 objects.

 We only need to list the bucket when doing a bucket deletion. If we
 could clean out/delete all objects in a bucket (without
 iterating/listing them) that would be ideal..

 On Mon, Mar 2, 2015 at 7:34 PM, GuangYang yguan...@outlook.com wrote:
  We have had good experience so far keeping each bucket less than 0.5
 million objects, by client side sharding. But I think it would be nice you
 can test at your scale, with your hardware configuration, as well as your
 expectation over the tail latency.
 
  Generally the bucket sharding should help, both for Write throughput
 and *stall with recovering/scrubbing*, but it comes with a prices -  The X
 shards you have for each bucket, the listing/trimming would be X times
 weighted, from OSD's load's point of view. There was discussion to
 implement: 1) blind bucket (for use cases bucket listing is not needed). 2)
 Un-ordered listing, which could improve the problem I mentioned above. They
 are on the roadmap...
 
  Thanks,
  Guang
 
 
  
  From: bhi...@gmail.com
  Date: Mon, 2 Mar 2015 18:13:25 -0800
  To: erdem.agao...@gmail.com
  CC: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Some long running ops may lock osd
 
  We're seeing a lot of this as well. (as i mentioned to sage at
  SCALE..) Is there a rule of thumb at all for how big is safe to let a
  RGW bucket get?
 
  Also, is this theoretically resolved by the new bucket-sharding
  feature in the latest dev release?
 
  -Ben
 
  On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu 
 erdem.agao...@gmail.com wrote:
  Hi Gregory,
 
  We are not using listomapkeys that way or in any way to be precise. I
 used
  it here just to reproduce the behavior/issue.
 
  What i am really interested in is if scrubbing-deep actually
 mitigates the
  problem and/or is there something that can be further improved.
 
  Or i guess we should go upgrade now and hope for the best :)
 
  On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com
 wrote:
 
  On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu 
 erdem.agao...@gmail.com
  wrote:
  Hi all, especially devs,
 
  We have recently pinpointed one of the causes of slow requests in
 our
  cluster. It seems deep-scrubs on pg's that contain the index file
 for a
  large radosgw bucket lock the osds. Incresing op threads and/or disk
  threads
  helps a little bit, but we need to increase them beyond reason in
 order
  to
  completely get rid of the problem. A somewhat similar (and more
 severe)
  version of the issue occurs when we call listomapkeys for the index
  file,
  and since the logs for deep-scrubbing was much harder read, this
  inspection
  was based on listomapkeys.
 
  In this example osd.121 is the primary of pg 10.c91 which contains
 file
  .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket
 contains
  ~500k objects. Standard listomapkeys call take about 3 seconds.
 
  time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null
  real 0m2.983s
  user 0m0.760s
  sys 0m0.148s
 
  In order to lock the osd we request 2 of them simultaneously with
  something
  like:
 
  rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 
  sleep 1
  rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 
 
  'debug_osd=30' logs show the flow like:
 
  At t0 some thread enqueue_op's my omap-get-keys request.
  Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading
 ~500k
  keys.
  Op-Thread B responds to several other requests during that 1 second
  sleep.
  They're generally extremely fast subops on other pgs.
  At t1 (about a second later) my

Re: [ceph-users] Some long running ops may lock osd

2015-03-02 Thread Ben Hines
Blind-bucket would be perfect for us, as we don't need to list the objects.

We only need to list the bucket when doing a bucket deletion. If we
could clean out/delete all objects in a bucket (without
iterating/listing them) that would be ideal..

On Mon, Mar 2, 2015 at 7:34 PM, GuangYang yguan...@outlook.com wrote:
 We have had good experience so far keeping each bucket less than 0.5 million 
 objects, by client side sharding. But I think it would be nice you can test 
 at your scale, with your hardware configuration, as well as your expectation 
 over the tail latency.

 Generally the bucket sharding should help, both for Write throughput and 
 *stall with recovering/scrubbing*, but it comes with a prices -  The X shards 
 you have for each bucket, the listing/trimming would be X times weighted, 
 from OSD's load's point of view. There was discussion to implement: 1) blind 
 bucket (for use cases bucket listing is not needed). 2) Un-ordered listing, 
 which could improve the problem I mentioned above. They are on the roadmap...

 Thanks,
 Guang


 
 From: bhi...@gmail.com
 Date: Mon, 2 Mar 2015 18:13:25 -0800
 To: erdem.agao...@gmail.com
 CC: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Some long running ops may lock osd

 We're seeing a lot of this as well. (as i mentioned to sage at
 SCALE..) Is there a rule of thumb at all for how big is safe to let a
 RGW bucket get?

 Also, is this theoretically resolved by the new bucket-sharding
 feature in the latest dev release?

 -Ben

 On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu erdem.agao...@gmail.com 
 wrote:
 Hi Gregory,

 We are not using listomapkeys that way or in any way to be precise. I used
 it here just to reproduce the behavior/issue.

 What i am really interested in is if scrubbing-deep actually mitigates the
 problem and/or is there something that can be further improved.

 Or i guess we should go upgrade now and hope for the best :)

 On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com wrote:

 On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com
 wrote:
 Hi all, especially devs,

 We have recently pinpointed one of the causes of slow requests in our
 cluster. It seems deep-scrubs on pg's that contain the index file for a
 large radosgw bucket lock the osds. Incresing op threads and/or disk
 threads
 helps a little bit, but we need to increase them beyond reason in order
 to
 completely get rid of the problem. A somewhat similar (and more severe)
 version of the issue occurs when we call listomapkeys for the index
 file,
 and since the logs for deep-scrubbing was much harder read, this
 inspection
 was based on listomapkeys.

 In this example osd.121 is the primary of pg 10.c91 which contains file
 .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains
 ~500k objects. Standard listomapkeys call take about 3 seconds.

 time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null
 real 0m2.983s
 user 0m0.760s
 sys 0m0.148s

 In order to lock the osd we request 2 of them simultaneously with
 something
 like:

 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 
 sleep 1
 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 

 'debug_osd=30' logs show the flow like:

 At t0 some thread enqueue_op's my omap-get-keys request.
 Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k
 keys.
 Op-Thread B responds to several other requests during that 1 second
 sleep.
 They're generally extremely fast subops on other pgs.
 At t1 (about a second later) my second omap-get-keys request gets
 enqueue_op'ed. But it does not start probably because of the lock held
 by
 Thread A.
 After that point other threads enqueue_op other requests on other pgs
 too
 but none of them starts processing, in which i consider the osd is
 locked.
 At t2 (about another second later) my first omap-get-keys request is
 finished.
 Op-Thread B locks pg 10.c91 and dequeue_op's my second request and
 starts
 reading ~500k keys again.
 Op-Thread A continues to process the requests enqueued in t1-t2.

 It seems Op-Thread B is waiting on the lock held by Op-Thread A while it
 can
 process other requests for other pg's just fine.

 My guess is a somewhat larger scenario happens in deep-scrubbing, like
 on
 the pg containing index for the bucket of20M objects. A disk/op thread
 starts reading through the omap which will take say 60 seconds. During
 the
 first seconds, other requests for other pgs pass just fine. But in 60
 seconds there are bound to be other requests for the same pg, especially
 since it holds the index file. Each of these requests lock another
 disk/op
 thread to the point where there are no free threads left to process any
 requests for any pg. Causing slow-requests.

 So first of all thanks if you can make it here, and sorry for the
 involved
 mail, i'm exploring the problem as i go.
 Now, is that deep-scrubbing situation i tried to theorize even

Re: [ceph-users] Some long running ops may lock osd

2015-03-02 Thread GuangYang
We have had good experience so far keeping each bucket less than 0.5 million 
objects, by client side sharding. But I think it would be nice you can test at 
your scale, with your hardware configuration, as well as your expectation over 
the tail latency.

Generally the bucket sharding should help, both for Write throughput and *stall 
with recovering/scrubbing*, but it comes with a prices -  The X shards you have 
for each bucket, the listing/trimming would be X times weighted, from OSD's 
load's point of view. There was discussion to implement: 1) blind bucket (for 
use cases bucket listing is not needed). 2) Un-ordered listing, which could 
improve the problem I mentioned above. They are on the roadmap...

Thanks,
Guang



 From: bhi...@gmail.com
 Date: Mon, 2 Mar 2015 18:13:25 -0800
 To: erdem.agao...@gmail.com
 CC: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Some long running ops may lock osd

 We're seeing a lot of this as well. (as i mentioned to sage at
 SCALE..) Is there a rule of thumb at all for how big is safe to let a
 RGW bucket get?

 Also, is this theoretically resolved by the new bucket-sharding
 feature in the latest dev release?

 -Ben

 On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu erdem.agao...@gmail.com 
 wrote:
 Hi Gregory,

 We are not using listomapkeys that way or in any way to be precise. I used
 it here just to reproduce the behavior/issue.

 What i am really interested in is if scrubbing-deep actually mitigates the
 problem and/or is there something that can be further improved.

 Or i guess we should go upgrade now and hope for the best :)

 On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com wrote:

 On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com
 wrote:
 Hi all, especially devs,

 We have recently pinpointed one of the causes of slow requests in our
 cluster. It seems deep-scrubs on pg's that contain the index file for a
 large radosgw bucket lock the osds. Incresing op threads and/or disk
 threads
 helps a little bit, but we need to increase them beyond reason in order
 to
 completely get rid of the problem. A somewhat similar (and more severe)
 version of the issue occurs when we call listomapkeys for the index
 file,
 and since the logs for deep-scrubbing was much harder read, this
 inspection
 was based on listomapkeys.

 In this example osd.121 is the primary of pg 10.c91 which contains file
 .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains
 ~500k objects. Standard listomapkeys call take about 3 seconds.

 time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null
 real 0m2.983s
 user 0m0.760s
 sys 0m0.148s

 In order to lock the osd we request 2 of them simultaneously with
 something
 like:

 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 
 sleep 1
 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 

 'debug_osd=30' logs show the flow like:

 At t0 some thread enqueue_op's my omap-get-keys request.
 Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k
 keys.
 Op-Thread B responds to several other requests during that 1 second
 sleep.
 They're generally extremely fast subops on other pgs.
 At t1 (about a second later) my second omap-get-keys request gets
 enqueue_op'ed. But it does not start probably because of the lock held
 by
 Thread A.
 After that point other threads enqueue_op other requests on other pgs
 too
 but none of them starts processing, in which i consider the osd is
 locked.
 At t2 (about another second later) my first omap-get-keys request is
 finished.
 Op-Thread B locks pg 10.c91 and dequeue_op's my second request and
 starts
 reading ~500k keys again.
 Op-Thread A continues to process the requests enqueued in t1-t2.

 It seems Op-Thread B is waiting on the lock held by Op-Thread A while it
 can
 process other requests for other pg's just fine.

 My guess is a somewhat larger scenario happens in deep-scrubbing, like
 on
 the pg containing index for the bucket of20M objects. A disk/op thread
 starts reading through the omap which will take say 60 seconds. During
 the
 first seconds, other requests for other pgs pass just fine. But in 60
 seconds there are bound to be other requests for the same pg, especially
 since it holds the index file. Each of these requests lock another
 disk/op
 thread to the point where there are no free threads left to process any
 requests for any pg. Causing slow-requests.

 So first of all thanks if you can make it here, and sorry for the
 involved
 mail, i'm exploring the problem as i go.
 Now, is that deep-scrubbing situation i tried to theorize even possible?
 If
 not can you point us where to look further.
 We are currently running 0.72.2 and know about newer ioprio settings in
 Firefly and such. While we are planning to upgrade in a few weeks but i
 don't think those options will help us in any way. Am i correct?
 Are there any other improvements that we are not aware

Re: [ceph-users] Some long running ops may lock osd

2015-03-02 Thread Ben Hines
We're seeing a lot of this as well. (as i mentioned to sage at
SCALE..) Is there a rule of thumb at all for how big is safe to let a
RGW bucket get?

Also, is this theoretically resolved by the new bucket-sharding
feature in the latest dev release?

-Ben

On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote:
 Hi Gregory,

 We are not using listomapkeys that way or in any way to be precise. I used
 it here just to reproduce the behavior/issue.

 What i am really interested in is if scrubbing-deep actually mitigates the
 problem and/or is there something that can be further improved.

 Or i guess we should go upgrade now and hope for the best :)

 On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com wrote:

 On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com
 wrote:
  Hi all, especially devs,
 
  We have recently pinpointed one of the causes of slow requests in our
  cluster. It seems deep-scrubs on pg's that contain the index file for a
  large radosgw bucket lock the osds. Incresing op threads and/or disk
  threads
  helps a little bit, but we need to increase them beyond reason in order
  to
  completely get rid of the problem. A somewhat similar (and more severe)
  version of the issue occurs when we call listomapkeys for the index
  file,
  and since the logs for deep-scrubbing was much harder read, this
  inspection
  was based on listomapkeys.
 
  In this example osd.121 is the primary of pg 10.c91 which contains file
  .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains
  ~500k objects. Standard listomapkeys call take about 3 seconds.
 
  time rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null
  real 0m2.983s
  user 0m0.760s
  sys 0m0.148s
 
  In order to lock the osd we request 2 of them simultaneously with
  something
  like:
 
  rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null 
  sleep 1
  rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null 
 
  'debug_osd=30' logs show the flow like:
 
  At t0 some thread enqueue_op's my omap-get-keys request.
  Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k
  keys.
  Op-Thread B responds to several other requests during that 1 second
  sleep.
  They're generally extremely fast subops on other pgs.
  At t1 (about a second later) my second omap-get-keys request gets
  enqueue_op'ed. But it does not start probably because of the lock held
  by
  Thread A.
  After that point other threads enqueue_op other requests on other pgs
  too
  but none of them starts processing, in which i consider the osd is
  locked.
  At t2 (about another second later) my first omap-get-keys request is
  finished.
  Op-Thread B locks pg 10.c91 and dequeue_op's my second request and
  starts
  reading ~500k keys again.
  Op-Thread A continues to process the requests enqueued in t1-t2.
 
  It seems Op-Thread B is waiting on the lock held by Op-Thread A while it
  can
  process other requests for other pg's just fine.
 
  My guess is a somewhat larger scenario happens in deep-scrubbing, like
  on
  the pg containing index for the bucket of 20M objects. A disk/op thread
  starts reading through the omap which will take say 60 seconds. During
  the
  first seconds, other requests for other pgs pass just fine. But in 60
  seconds there are bound to be other requests for the same pg, especially
  since it holds the index file. Each of these requests lock another
  disk/op
  thread to the point where there are no free threads left to process any
  requests for any pg. Causing slow-requests.
 
  So first of all thanks if you can make it here, and sorry for the
  involved
  mail, i'm exploring the problem as i go.
  Now, is that deep-scrubbing situation i tried to theorize even possible?
  If
  not can you point us where to look further.
  We are currently running 0.72.2 and know about newer ioprio settings in
  Firefly and such. While we are planning to upgrade in a few weeks but i
  don't think those options will help us in any way. Am i correct?
  Are there any other improvements that we are not aware?

 This is all basically correct; it's one of the reasons you don't want
 to let individual buckets get too large.

 That said, I'm a little confused about why you're running listomapkeys
 that way. RGW throttles itself by getting only a certain number of
 entries at a time (1000?) and any system you're also building should
 do the same. That would reduce the frequency of any issues, and I
 *think* that scrubbing has some mitigating factors to help (although
 maybe not; it's been a while since I looked at any of that stuff).

 Although I just realized that my vague memory of deep scrubbing
 working better might be based on improvements that only got in for
 firefly...not sure.
 -Greg




 --
 erdem agaoglu

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some long running ops may lock osd

2015-03-02 Thread Erdem Agaoglu
Thank you folks for bringing that up. I had some questions about sharding.
We'd like blind buckets too, at least it's on the roadmap. For the current
sharded implementation, what are the final details? Is number of shards
defined per bucket or globally? Is there a way to split current indexes
into shards?

On the other hand what i'd like to point here is not necessarily
large-bucket-index specific. The problem is the mechanism around thread
pools. Any request may require locks on a pg and this should not block the
requests for other pgs. I'm no expert but the threads may be able to
requeue the requests to a locked pg, processing others for other pgs. Or
maybe a thread per pg design was possible. Because, you know, it is
somewhat OK not being able to do anything for a locked resource. Then you
can go and improve your processing or your locks. But it's a whole
different problem when a locked pg blocks requests for a few hundred other
pgs in other pools for no good reason.

On Tue, Mar 3, 2015 at 5:43 AM, Ben Hines bhi...@gmail.com wrote:

 Blind-bucket would be perfect for us, as we don't need to list the objects.

 We only need to list the bucket when doing a bucket deletion. If we
 could clean out/delete all objects in a bucket (without
 iterating/listing them) that would be ideal..

 On Mon, Mar 2, 2015 at 7:34 PM, GuangYang yguan...@outlook.com wrote:
  We have had good experience so far keeping each bucket less than 0.5
 million objects, by client side sharding. But I think it would be nice you
 can test at your scale, with your hardware configuration, as well as your
 expectation over the tail latency.
 
  Generally the bucket sharding should help, both for Write throughput and
 *stall with recovering/scrubbing*, but it comes with a prices -  The X
 shards you have for each bucket, the listing/trimming would be X times
 weighted, from OSD's load's point of view. There was discussion to
 implement: 1) blind bucket (for use cases bucket listing is not needed). 2)
 Un-ordered listing, which could improve the problem I mentioned above. They
 are on the roadmap...
 
  Thanks,
  Guang
 
 
  
  From: bhi...@gmail.com
  Date: Mon, 2 Mar 2015 18:13:25 -0800
  To: erdem.agao...@gmail.com
  CC: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Some long running ops may lock osd
 
  We're seeing a lot of this as well. (as i mentioned to sage at
  SCALE..) Is there a rule of thumb at all for how big is safe to let a
  RGW bucket get?
 
  Also, is this theoretically resolved by the new bucket-sharding
  feature in the latest dev release?
 
  -Ben
 
  On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu erdem.agao...@gmail.com
 wrote:
  Hi Gregory,
 
  We are not using listomapkeys that way or in any way to be precise. I
 used
  it here just to reproduce the behavior/issue.
 
  What i am really interested in is if scrubbing-deep actually mitigates
 the
  problem and/or is there something that can be further improved.
 
  Or i guess we should go upgrade now and hope for the best :)
 
  On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com
 wrote:
 
  On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu 
 erdem.agao...@gmail.com
  wrote:
  Hi all, especially devs,
 
  We have recently pinpointed one of the causes of slow requests in our
  cluster. It seems deep-scrubs on pg's that contain the index file
 for a
  large radosgw bucket lock the osds. Incresing op threads and/or disk
  threads
  helps a little bit, but we need to increase them beyond reason in
 order
  to
  completely get rid of the problem. A somewhat similar (and more
 severe)
  version of the issue occurs when we call listomapkeys for the index
  file,
  and since the logs for deep-scrubbing was much harder read, this
  inspection
  was based on listomapkeys.
 
  In this example osd.121 is the primary of pg 10.c91 which contains
 file
  .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket
 contains
  ~500k objects. Standard listomapkeys call take about 3 seconds.
 
  time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null
  real 0m2.983s
  user 0m0.760s
  sys 0m0.148s
 
  In order to lock the osd we request 2 of them simultaneously with
  something
  like:
 
  rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 
  sleep 1
  rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 
 
  'debug_osd=30' logs show the flow like:
 
  At t0 some thread enqueue_op's my omap-get-keys request.
  Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading
 ~500k
  keys.
  Op-Thread B responds to several other requests during that 1 second
  sleep.
  They're generally extremely fast subops on other pgs.
  At t1 (about a second later) my second omap-get-keys request gets
  enqueue_op'ed. But it does not start probably because of the lock
 held
  by
  Thread A.
  After that point other threads enqueue_op other requests on other pgs
  too
  but none of them starts processing, in which i consider

[ceph-users] Some long running ops may lock osd

2015-03-02 Thread Erdem Agaoglu
Hi all, especially devs,

We have recently pinpointed one of the causes of slow requests in our
cluster. It seems deep-scrubs on pg's that contain the index file for a
large radosgw bucket lock the osds. Incresing op threads and/or disk
threads helps a little bit, but we need to increase them beyond reason in
order to completely get rid of the problem. A somewhat similar (and more
severe) version of the issue occurs when we call listomapkeys for the index
file, and since the logs for deep-scrubbing was much harder read, this
inspection was based on listomapkeys.

In this example osd.121 is the primary of pg 10.c91 which contains file
.dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains
~500k objects. Standard listomapkeys call take about 3 seconds.

time rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null
real 0m2.983s
user 0m0.760s
sys 0m0.148s

In order to lock the osd we request 2 of them simultaneously with something
like:

rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null 
sleep 1
rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null 

'debug_osd=30' logs show the flow like:

At t0 some thread enqueue_op's my omap-get-keys request.
Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k
keys.
Op-Thread B responds to several other requests during that 1 second sleep.
They're generally extremely fast subops on other pgs.
At t1 (about a second later) my second omap-get-keys request gets
enqueue_op'ed. But it does not start probably because of the lock held by
Thread A.
After that point other threads enqueue_op other requests on other pgs too
but none of them starts processing, in which i consider the osd is locked.
At t2 (about another second later) my first omap-get-keys request is
finished.
Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts
reading ~500k keys again.
Op-Thread A continues to process the requests enqueued in t1-t2.

It seems Op-Thread B is waiting on the lock held by Op-Thread A while it
can process other requests for other pg's just fine.

My guess is a somewhat larger scenario happens in deep-scrubbing, like on
the pg containing index for the bucket of 20M objects. A disk/op thread
starts reading through the omap which will take say 60 seconds. During the
first seconds, other requests for other pgs pass just fine. But in 60
seconds there are bound to be other requests for the same pg, especially
since it holds the index file. Each of these requests lock another disk/op
thread to the point where there are no free threads left to process any
requests for any pg. Causing slow-requests.

So first of all thanks if you can make it here, and sorry for the involved
mail, i'm exploring the problem as i go.
Now, is that deep-scrubbing situation i tried to theorize even possible? If
not can you point us where to look further.
We are currently running 0.72.2 and know about newer ioprio settings in
Firefly and such. While we are planning to upgrade in a few weeks but i
don't think those options will help us in any way. Am i correct?
Are there any other improvements that we are not aware?

Regards,


-- 
erdem agaoglu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some long running ops may lock osd

2015-03-02 Thread Gregory Farnum
On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote:
 Hi all, especially devs,

 We have recently pinpointed one of the causes of slow requests in our
 cluster. It seems deep-scrubs on pg's that contain the index file for a
 large radosgw bucket lock the osds. Incresing op threads and/or disk threads
 helps a little bit, but we need to increase them beyond reason in order to
 completely get rid of the problem. A somewhat similar (and more severe)
 version of the issue occurs when we call listomapkeys for the index file,
 and since the logs for deep-scrubbing was much harder read, this inspection
 was based on listomapkeys.

 In this example osd.121 is the primary of pg 10.c91 which contains file
 .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains
 ~500k objects. Standard listomapkeys call take about 3 seconds.

 time rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null
 real 0m2.983s
 user 0m0.760s
 sys 0m0.148s

 In order to lock the osd we request 2 of them simultaneously with something
 like:

 rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null 
 sleep 1
 rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null 

 'debug_osd=30' logs show the flow like:

 At t0 some thread enqueue_op's my omap-get-keys request.
 Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k
 keys.
 Op-Thread B responds to several other requests during that 1 second sleep.
 They're generally extremely fast subops on other pgs.
 At t1 (about a second later) my second omap-get-keys request gets
 enqueue_op'ed. But it does not start probably because of the lock held by
 Thread A.
 After that point other threads enqueue_op other requests on other pgs too
 but none of them starts processing, in which i consider the osd is locked.
 At t2 (about another second later) my first omap-get-keys request is
 finished.
 Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts
 reading ~500k keys again.
 Op-Thread A continues to process the requests enqueued in t1-t2.

 It seems Op-Thread B is waiting on the lock held by Op-Thread A while it can
 process other requests for other pg's just fine.

 My guess is a somewhat larger scenario happens in deep-scrubbing, like on
 the pg containing index for the bucket of 20M objects. A disk/op thread
 starts reading through the omap which will take say 60 seconds. During the
 first seconds, other requests for other pgs pass just fine. But in 60
 seconds there are bound to be other requests for the same pg, especially
 since it holds the index file. Each of these requests lock another disk/op
 thread to the point where there are no free threads left to process any
 requests for any pg. Causing slow-requests.

 So first of all thanks if you can make it here, and sorry for the involved
 mail, i'm exploring the problem as i go.
 Now, is that deep-scrubbing situation i tried to theorize even possible? If
 not can you point us where to look further.
 We are currently running 0.72.2 and know about newer ioprio settings in
 Firefly and such. While we are planning to upgrade in a few weeks but i
 don't think those options will help us in any way. Am i correct?
 Are there any other improvements that we are not aware?

This is all basically correct; it's one of the reasons you don't want
to let individual buckets get too large.

That said, I'm a little confused about why you're running listomapkeys
that way. RGW throttles itself by getting only a certain number of
entries at a time (1000?) and any system you're also building should
do the same. That would reduce the frequency of any issues, and I
*think* that scrubbing has some mitigating factors to help (although
maybe not; it's been a while since I looked at any of that stuff).

Although I just realized that my vague memory of deep scrubbing
working better might be based on improvements that only got in for
firefly...not sure.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some long running ops may lock osd

2015-03-02 Thread Erdem Agaoglu
Hi Gregory,

We are not using listomapkeys that way or in any way to be precise. I used
it here just to reproduce the behavior/issue.

What i am really interested in is if scrubbing-deep actually mitigates the
problem and/or is there something that can be further improved.

Or i guess we should go upgrade now and hope for the best :)

On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com wrote:

 On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com
 wrote:
  Hi all, especially devs,
 
  We have recently pinpointed one of the causes of slow requests in our
  cluster. It seems deep-scrubs on pg's that contain the index file for a
  large radosgw bucket lock the osds. Incresing op threads and/or disk
 threads
  helps a little bit, but we need to increase them beyond reason in order
 to
  completely get rid of the problem. A somewhat similar (and more severe)
  version of the issue occurs when we call listomapkeys for the index file,
  and since the logs for deep-scrubbing was much harder read, this
 inspection
  was based on listomapkeys.
 
  In this example osd.121 is the primary of pg 10.c91 which contains file
  .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains
  ~500k objects. Standard listomapkeys call take about 3 seconds.
 
  time rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null
  real 0m2.983s
  user 0m0.760s
  sys 0m0.148s
 
  In order to lock the osd we request 2 of them simultaneously with
 something
  like:
 
  rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null 
  sleep 1
  rados -p .rgw.buckets listomapkeys .dir.5926.3  /dev/null 
 
  'debug_osd=30' logs show the flow like:
 
  At t0 some thread enqueue_op's my omap-get-keys request.
  Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k
  keys.
  Op-Thread B responds to several other requests during that 1 second
 sleep.
  They're generally extremely fast subops on other pgs.
  At t1 (about a second later) my second omap-get-keys request gets
  enqueue_op'ed. But it does not start probably because of the lock held by
  Thread A.
  After that point other threads enqueue_op other requests on other pgs too
  but none of them starts processing, in which i consider the osd is
 locked.
  At t2 (about another second later) my first omap-get-keys request is
  finished.
  Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts
  reading ~500k keys again.
  Op-Thread A continues to process the requests enqueued in t1-t2.
 
  It seems Op-Thread B is waiting on the lock held by Op-Thread A while it
 can
  process other requests for other pg's just fine.
 
  My guess is a somewhat larger scenario happens in deep-scrubbing, like on
  the pg containing index for the bucket of 20M objects. A disk/op thread
  starts reading through the omap which will take say 60 seconds. During
 the
  first seconds, other requests for other pgs pass just fine. But in 60
  seconds there are bound to be other requests for the same pg, especially
  since it holds the index file. Each of these requests lock another
 disk/op
  thread to the point where there are no free threads left to process any
  requests for any pg. Causing slow-requests.
 
  So first of all thanks if you can make it here, and sorry for the
 involved
  mail, i'm exploring the problem as i go.
  Now, is that deep-scrubbing situation i tried to theorize even possible?
 If
  not can you point us where to look further.
  We are currently running 0.72.2 and know about newer ioprio settings in
  Firefly and such. While we are planning to upgrade in a few weeks but i
  don't think those options will help us in any way. Am i correct?
  Are there any other improvements that we are not aware?

 This is all basically correct; it's one of the reasons you don't want
 to let individual buckets get too large.

 That said, I'm a little confused about why you're running listomapkeys
 that way. RGW throttles itself by getting only a certain number of
 entries at a time (1000?) and any system you're also building should
 do the same. That would reduce the frequency of any issues, and I
 *think* that scrubbing has some mitigating factors to help (although
 maybe not; it's been a while since I looked at any of that stuff).

 Although I just realized that my vague memory of deep scrubbing
 working better might be based on improvements that only got in for
 firefly...not sure.
 -Greg




-- 
erdem agaoglu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com