Re: [ceph-users] Some long running ops may lock osd
Looking further, i guess what i tried to tell was a simplified version of sharded threadpools, released in giant. Is it possible for that to be backported to firefly? On Tue, Mar 3, 2015 at 9:33 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Thank you folks for bringing that up. I had some questions about sharding. We'd like blind buckets too, at least it's on the roadmap. For the current sharded implementation, what are the final details? Is number of shards defined per bucket or globally? Is there a way to split current indexes into shards? On the other hand what i'd like to point here is not necessarily large-bucket-index specific. The problem is the mechanism around thread pools. Any request may require locks on a pg and this should not block the requests for other pgs. I'm no expert but the threads may be able to requeue the requests to a locked pg, processing others for other pgs. Or maybe a thread per pg design was possible. Because, you know, it is somewhat OK not being able to do anything for a locked resource. Then you can go and improve your processing or your locks. But it's a whole different problem when a locked pg blocks requests for a few hundred other pgs in other pools for no good reason. On Tue, Mar 3, 2015 at 5:43 AM, Ben Hines bhi...@gmail.com wrote: Blind-bucket would be perfect for us, as we don't need to list the objects. We only need to list the bucket when doing a bucket deletion. If we could clean out/delete all objects in a bucket (without iterating/listing them) that would be ideal.. On Mon, Mar 2, 2015 at 7:34 PM, GuangYang yguan...@outlook.com wrote: We have had good experience so far keeping each bucket less than 0.5 million objects, by client side sharding. But I think it would be nice you can test at your scale, with your hardware configuration, as well as your expectation over the tail latency. Generally the bucket sharding should help, both for Write throughput and *stall with recovering/scrubbing*, but it comes with a prices - The X shards you have for each bucket, the listing/trimming would be X times weighted, from OSD's load's point of view. There was discussion to implement: 1) blind bucket (for use cases bucket listing is not needed). 2) Un-ordered listing, which could improve the problem I mentioned above. They are on the roadmap... Thanks, Guang From: bhi...@gmail.com Date: Mon, 2 Mar 2015 18:13:25 -0800 To: erdem.agao...@gmail.com CC: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Some long running ops may lock osd We're seeing a lot of this as well. (as i mentioned to sage at SCALE..) Is there a rule of thumb at all for how big is safe to let a RGW bucket get? Also, is this theoretically resolved by the new bucket-sharding feature in the latest dev release? -Ben On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi Gregory, We are not using listomapkeys that way or in any way to be precise. I used it here just to reproduce the behavior/issue. What i am really interested in is if scrubbing-deep actually mitigates the problem and/or is there something that can be further improved. Or i guess we should go upgrade now and hope for the best :) On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com wrote: On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi all, especially devs, We have recently pinpointed one of the causes of slow requests in our cluster. It seems deep-scrubs on pg's that contain the index file for a large radosgw bucket lock the osds. Incresing op threads and/or disk threads helps a little bit, but we need to increase them beyond reason in order to completely get rid of the problem. A somewhat similar (and more severe) version of the issue occurs when we call listomapkeys for the index file, and since the logs for deep-scrubbing was much harder read, this inspection was based on listomapkeys. In this example osd.121 is the primary of pg 10.c91 which contains file .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains ~500k objects. Standard listomapkeys call take about 3 seconds. time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null real 0m2.983s user 0m0.760s sys 0m0.148s In order to lock the osd we request 2 of them simultaneously with something like: rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null sleep 1 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 'debug_osd=30' logs show the flow like: At t0 some thread enqueue_op's my omap-get-keys request. Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k keys. Op-Thread B responds to several other requests during that 1 second sleep. They're generally extremely fast subops on other pgs. At t1 (about a second later) my
Re: [ceph-users] Some long running ops may lock osd
Blind-bucket would be perfect for us, as we don't need to list the objects. We only need to list the bucket when doing a bucket deletion. If we could clean out/delete all objects in a bucket (without iterating/listing them) that would be ideal.. On Mon, Mar 2, 2015 at 7:34 PM, GuangYang yguan...@outlook.com wrote: We have had good experience so far keeping each bucket less than 0.5 million objects, by client side sharding. But I think it would be nice you can test at your scale, with your hardware configuration, as well as your expectation over the tail latency. Generally the bucket sharding should help, both for Write throughput and *stall with recovering/scrubbing*, but it comes with a prices - The X shards you have for each bucket, the listing/trimming would be X times weighted, from OSD's load's point of view. There was discussion to implement: 1) blind bucket (for use cases bucket listing is not needed). 2) Un-ordered listing, which could improve the problem I mentioned above. They are on the roadmap... Thanks, Guang From: bhi...@gmail.com Date: Mon, 2 Mar 2015 18:13:25 -0800 To: erdem.agao...@gmail.com CC: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Some long running ops may lock osd We're seeing a lot of this as well. (as i mentioned to sage at SCALE..) Is there a rule of thumb at all for how big is safe to let a RGW bucket get? Also, is this theoretically resolved by the new bucket-sharding feature in the latest dev release? -Ben On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi Gregory, We are not using listomapkeys that way or in any way to be precise. I used it here just to reproduce the behavior/issue. What i am really interested in is if scrubbing-deep actually mitigates the problem and/or is there something that can be further improved. Or i guess we should go upgrade now and hope for the best :) On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com wrote: On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi all, especially devs, We have recently pinpointed one of the causes of slow requests in our cluster. It seems deep-scrubs on pg's that contain the index file for a large radosgw bucket lock the osds. Incresing op threads and/or disk threads helps a little bit, but we need to increase them beyond reason in order to completely get rid of the problem. A somewhat similar (and more severe) version of the issue occurs when we call listomapkeys for the index file, and since the logs for deep-scrubbing was much harder read, this inspection was based on listomapkeys. In this example osd.121 is the primary of pg 10.c91 which contains file .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains ~500k objects. Standard listomapkeys call take about 3 seconds. time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null real 0m2.983s user 0m0.760s sys 0m0.148s In order to lock the osd we request 2 of them simultaneously with something like: rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null sleep 1 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 'debug_osd=30' logs show the flow like: At t0 some thread enqueue_op's my omap-get-keys request. Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k keys. Op-Thread B responds to several other requests during that 1 second sleep. They're generally extremely fast subops on other pgs. At t1 (about a second later) my second omap-get-keys request gets enqueue_op'ed. But it does not start probably because of the lock held by Thread A. After that point other threads enqueue_op other requests on other pgs too but none of them starts processing, in which i consider the osd is locked. At t2 (about another second later) my first omap-get-keys request is finished. Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts reading ~500k keys again. Op-Thread A continues to process the requests enqueued in t1-t2. It seems Op-Thread B is waiting on the lock held by Op-Thread A while it can process other requests for other pg's just fine. My guess is a somewhat larger scenario happens in deep-scrubbing, like on the pg containing index for the bucket of20M objects. A disk/op thread starts reading through the omap which will take say 60 seconds. During the first seconds, other requests for other pgs pass just fine. But in 60 seconds there are bound to be other requests for the same pg, especially since it holds the index file. Each of these requests lock another disk/op thread to the point where there are no free threads left to process any requests for any pg. Causing slow-requests. So first of all thanks if you can make it here, and sorry for the involved mail, i'm exploring the problem as i go. Now, is that deep-scrubbing situation i tried to theorize even
Re: [ceph-users] Some long running ops may lock osd
We have had good experience so far keeping each bucket less than 0.5 million objects, by client side sharding. But I think it would be nice you can test at your scale, with your hardware configuration, as well as your expectation over the tail latency. Generally the bucket sharding should help, both for Write throughput and *stall with recovering/scrubbing*, but it comes with a prices - The X shards you have for each bucket, the listing/trimming would be X times weighted, from OSD's load's point of view. There was discussion to implement: 1) blind bucket (for use cases bucket listing is not needed). 2) Un-ordered listing, which could improve the problem I mentioned above. They are on the roadmap... Thanks, Guang From: bhi...@gmail.com Date: Mon, 2 Mar 2015 18:13:25 -0800 To: erdem.agao...@gmail.com CC: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Some long running ops may lock osd We're seeing a lot of this as well. (as i mentioned to sage at SCALE..) Is there a rule of thumb at all for how big is safe to let a RGW bucket get? Also, is this theoretically resolved by the new bucket-sharding feature in the latest dev release? -Ben On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi Gregory, We are not using listomapkeys that way or in any way to be precise. I used it here just to reproduce the behavior/issue. What i am really interested in is if scrubbing-deep actually mitigates the problem and/or is there something that can be further improved. Or i guess we should go upgrade now and hope for the best :) On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com wrote: On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi all, especially devs, We have recently pinpointed one of the causes of slow requests in our cluster. It seems deep-scrubs on pg's that contain the index file for a large radosgw bucket lock the osds. Incresing op threads and/or disk threads helps a little bit, but we need to increase them beyond reason in order to completely get rid of the problem. A somewhat similar (and more severe) version of the issue occurs when we call listomapkeys for the index file, and since the logs for deep-scrubbing was much harder read, this inspection was based on listomapkeys. In this example osd.121 is the primary of pg 10.c91 which contains file .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains ~500k objects. Standard listomapkeys call take about 3 seconds. time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null real 0m2.983s user 0m0.760s sys 0m0.148s In order to lock the osd we request 2 of them simultaneously with something like: rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null sleep 1 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 'debug_osd=30' logs show the flow like: At t0 some thread enqueue_op's my omap-get-keys request. Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k keys. Op-Thread B responds to several other requests during that 1 second sleep. They're generally extremely fast subops on other pgs. At t1 (about a second later) my second omap-get-keys request gets enqueue_op'ed. But it does not start probably because of the lock held by Thread A. After that point other threads enqueue_op other requests on other pgs too but none of them starts processing, in which i consider the osd is locked. At t2 (about another second later) my first omap-get-keys request is finished. Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts reading ~500k keys again. Op-Thread A continues to process the requests enqueued in t1-t2. It seems Op-Thread B is waiting on the lock held by Op-Thread A while it can process other requests for other pg's just fine. My guess is a somewhat larger scenario happens in deep-scrubbing, like on the pg containing index for the bucket of20M objects. A disk/op thread starts reading through the omap which will take say 60 seconds. During the first seconds, other requests for other pgs pass just fine. But in 60 seconds there are bound to be other requests for the same pg, especially since it holds the index file. Each of these requests lock another disk/op thread to the point where there are no free threads left to process any requests for any pg. Causing slow-requests. So first of all thanks if you can make it here, and sorry for the involved mail, i'm exploring the problem as i go. Now, is that deep-scrubbing situation i tried to theorize even possible? If not can you point us where to look further. We are currently running 0.72.2 and know about newer ioprio settings in Firefly and such. While we are planning to upgrade in a few weeks but i don't think those options will help us in any way. Am i correct? Are there any other improvements that we are not aware
Re: [ceph-users] Some long running ops may lock osd
We're seeing a lot of this as well. (as i mentioned to sage at SCALE..) Is there a rule of thumb at all for how big is safe to let a RGW bucket get? Also, is this theoretically resolved by the new bucket-sharding feature in the latest dev release? -Ben On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi Gregory, We are not using listomapkeys that way or in any way to be precise. I used it here just to reproduce the behavior/issue. What i am really interested in is if scrubbing-deep actually mitigates the problem and/or is there something that can be further improved. Or i guess we should go upgrade now and hope for the best :) On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com wrote: On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi all, especially devs, We have recently pinpointed one of the causes of slow requests in our cluster. It seems deep-scrubs on pg's that contain the index file for a large radosgw bucket lock the osds. Incresing op threads and/or disk threads helps a little bit, but we need to increase them beyond reason in order to completely get rid of the problem. A somewhat similar (and more severe) version of the issue occurs when we call listomapkeys for the index file, and since the logs for deep-scrubbing was much harder read, this inspection was based on listomapkeys. In this example osd.121 is the primary of pg 10.c91 which contains file .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains ~500k objects. Standard listomapkeys call take about 3 seconds. time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null real 0m2.983s user 0m0.760s sys 0m0.148s In order to lock the osd we request 2 of them simultaneously with something like: rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null sleep 1 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 'debug_osd=30' logs show the flow like: At t0 some thread enqueue_op's my omap-get-keys request. Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k keys. Op-Thread B responds to several other requests during that 1 second sleep. They're generally extremely fast subops on other pgs. At t1 (about a second later) my second omap-get-keys request gets enqueue_op'ed. But it does not start probably because of the lock held by Thread A. After that point other threads enqueue_op other requests on other pgs too but none of them starts processing, in which i consider the osd is locked. At t2 (about another second later) my first omap-get-keys request is finished. Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts reading ~500k keys again. Op-Thread A continues to process the requests enqueued in t1-t2. It seems Op-Thread B is waiting on the lock held by Op-Thread A while it can process other requests for other pg's just fine. My guess is a somewhat larger scenario happens in deep-scrubbing, like on the pg containing index for the bucket of 20M objects. A disk/op thread starts reading through the omap which will take say 60 seconds. During the first seconds, other requests for other pgs pass just fine. But in 60 seconds there are bound to be other requests for the same pg, especially since it holds the index file. Each of these requests lock another disk/op thread to the point where there are no free threads left to process any requests for any pg. Causing slow-requests. So first of all thanks if you can make it here, and sorry for the involved mail, i'm exploring the problem as i go. Now, is that deep-scrubbing situation i tried to theorize even possible? If not can you point us where to look further. We are currently running 0.72.2 and know about newer ioprio settings in Firefly and such. While we are planning to upgrade in a few weeks but i don't think those options will help us in any way. Am i correct? Are there any other improvements that we are not aware? This is all basically correct; it's one of the reasons you don't want to let individual buckets get too large. That said, I'm a little confused about why you're running listomapkeys that way. RGW throttles itself by getting only a certain number of entries at a time (1000?) and any system you're also building should do the same. That would reduce the frequency of any issues, and I *think* that scrubbing has some mitigating factors to help (although maybe not; it's been a while since I looked at any of that stuff). Although I just realized that my vague memory of deep scrubbing working better might be based on improvements that only got in for firefly...not sure. -Greg -- erdem agaoglu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some long running ops may lock osd
Thank you folks for bringing that up. I had some questions about sharding. We'd like blind buckets too, at least it's on the roadmap. For the current sharded implementation, what are the final details? Is number of shards defined per bucket or globally? Is there a way to split current indexes into shards? On the other hand what i'd like to point here is not necessarily large-bucket-index specific. The problem is the mechanism around thread pools. Any request may require locks on a pg and this should not block the requests for other pgs. I'm no expert but the threads may be able to requeue the requests to a locked pg, processing others for other pgs. Or maybe a thread per pg design was possible. Because, you know, it is somewhat OK not being able to do anything for a locked resource. Then you can go and improve your processing or your locks. But it's a whole different problem when a locked pg blocks requests for a few hundred other pgs in other pools for no good reason. On Tue, Mar 3, 2015 at 5:43 AM, Ben Hines bhi...@gmail.com wrote: Blind-bucket would be perfect for us, as we don't need to list the objects. We only need to list the bucket when doing a bucket deletion. If we could clean out/delete all objects in a bucket (without iterating/listing them) that would be ideal.. On Mon, Mar 2, 2015 at 7:34 PM, GuangYang yguan...@outlook.com wrote: We have had good experience so far keeping each bucket less than 0.5 million objects, by client side sharding. But I think it would be nice you can test at your scale, with your hardware configuration, as well as your expectation over the tail latency. Generally the bucket sharding should help, both for Write throughput and *stall with recovering/scrubbing*, but it comes with a prices - The X shards you have for each bucket, the listing/trimming would be X times weighted, from OSD's load's point of view. There was discussion to implement: 1) blind bucket (for use cases bucket listing is not needed). 2) Un-ordered listing, which could improve the problem I mentioned above. They are on the roadmap... Thanks, Guang From: bhi...@gmail.com Date: Mon, 2 Mar 2015 18:13:25 -0800 To: erdem.agao...@gmail.com CC: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Some long running ops may lock osd We're seeing a lot of this as well. (as i mentioned to sage at SCALE..) Is there a rule of thumb at all for how big is safe to let a RGW bucket get? Also, is this theoretically resolved by the new bucket-sharding feature in the latest dev release? -Ben On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi Gregory, We are not using listomapkeys that way or in any way to be precise. I used it here just to reproduce the behavior/issue. What i am really interested in is if scrubbing-deep actually mitigates the problem and/or is there something that can be further improved. Or i guess we should go upgrade now and hope for the best :) On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com wrote: On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi all, especially devs, We have recently pinpointed one of the causes of slow requests in our cluster. It seems deep-scrubs on pg's that contain the index file for a large radosgw bucket lock the osds. Incresing op threads and/or disk threads helps a little bit, but we need to increase them beyond reason in order to completely get rid of the problem. A somewhat similar (and more severe) version of the issue occurs when we call listomapkeys for the index file, and since the logs for deep-scrubbing was much harder read, this inspection was based on listomapkeys. In this example osd.121 is the primary of pg 10.c91 which contains file .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains ~500k objects. Standard listomapkeys call take about 3 seconds. time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null real 0m2.983s user 0m0.760s sys 0m0.148s In order to lock the osd we request 2 of them simultaneously with something like: rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null sleep 1 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 'debug_osd=30' logs show the flow like: At t0 some thread enqueue_op's my omap-get-keys request. Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k keys. Op-Thread B responds to several other requests during that 1 second sleep. They're generally extremely fast subops on other pgs. At t1 (about a second later) my second omap-get-keys request gets enqueue_op'ed. But it does not start probably because of the lock held by Thread A. After that point other threads enqueue_op other requests on other pgs too but none of them starts processing, in which i consider
[ceph-users] Some long running ops may lock osd
Hi all, especially devs, We have recently pinpointed one of the causes of slow requests in our cluster. It seems deep-scrubs on pg's that contain the index file for a large radosgw bucket lock the osds. Incresing op threads and/or disk threads helps a little bit, but we need to increase them beyond reason in order to completely get rid of the problem. A somewhat similar (and more severe) version of the issue occurs when we call listomapkeys for the index file, and since the logs for deep-scrubbing was much harder read, this inspection was based on listomapkeys. In this example osd.121 is the primary of pg 10.c91 which contains file .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains ~500k objects. Standard listomapkeys call take about 3 seconds. time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null real 0m2.983s user 0m0.760s sys 0m0.148s In order to lock the osd we request 2 of them simultaneously with something like: rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null sleep 1 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 'debug_osd=30' logs show the flow like: At t0 some thread enqueue_op's my omap-get-keys request. Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k keys. Op-Thread B responds to several other requests during that 1 second sleep. They're generally extremely fast subops on other pgs. At t1 (about a second later) my second omap-get-keys request gets enqueue_op'ed. But it does not start probably because of the lock held by Thread A. After that point other threads enqueue_op other requests on other pgs too but none of them starts processing, in which i consider the osd is locked. At t2 (about another second later) my first omap-get-keys request is finished. Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts reading ~500k keys again. Op-Thread A continues to process the requests enqueued in t1-t2. It seems Op-Thread B is waiting on the lock held by Op-Thread A while it can process other requests for other pg's just fine. My guess is a somewhat larger scenario happens in deep-scrubbing, like on the pg containing index for the bucket of 20M objects. A disk/op thread starts reading through the omap which will take say 60 seconds. During the first seconds, other requests for other pgs pass just fine. But in 60 seconds there are bound to be other requests for the same pg, especially since it holds the index file. Each of these requests lock another disk/op thread to the point where there are no free threads left to process any requests for any pg. Causing slow-requests. So first of all thanks if you can make it here, and sorry for the involved mail, i'm exploring the problem as i go. Now, is that deep-scrubbing situation i tried to theorize even possible? If not can you point us where to look further. We are currently running 0.72.2 and know about newer ioprio settings in Firefly and such. While we are planning to upgrade in a few weeks but i don't think those options will help us in any way. Am i correct? Are there any other improvements that we are not aware? Regards, -- erdem agaoglu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some long running ops may lock osd
On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi all, especially devs, We have recently pinpointed one of the causes of slow requests in our cluster. It seems deep-scrubs on pg's that contain the index file for a large radosgw bucket lock the osds. Incresing op threads and/or disk threads helps a little bit, but we need to increase them beyond reason in order to completely get rid of the problem. A somewhat similar (and more severe) version of the issue occurs when we call listomapkeys for the index file, and since the logs for deep-scrubbing was much harder read, this inspection was based on listomapkeys. In this example osd.121 is the primary of pg 10.c91 which contains file .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains ~500k objects. Standard listomapkeys call take about 3 seconds. time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null real 0m2.983s user 0m0.760s sys 0m0.148s In order to lock the osd we request 2 of them simultaneously with something like: rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null sleep 1 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 'debug_osd=30' logs show the flow like: At t0 some thread enqueue_op's my omap-get-keys request. Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k keys. Op-Thread B responds to several other requests during that 1 second sleep. They're generally extremely fast subops on other pgs. At t1 (about a second later) my second omap-get-keys request gets enqueue_op'ed. But it does not start probably because of the lock held by Thread A. After that point other threads enqueue_op other requests on other pgs too but none of them starts processing, in which i consider the osd is locked. At t2 (about another second later) my first omap-get-keys request is finished. Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts reading ~500k keys again. Op-Thread A continues to process the requests enqueued in t1-t2. It seems Op-Thread B is waiting on the lock held by Op-Thread A while it can process other requests for other pg's just fine. My guess is a somewhat larger scenario happens in deep-scrubbing, like on the pg containing index for the bucket of 20M objects. A disk/op thread starts reading through the omap which will take say 60 seconds. During the first seconds, other requests for other pgs pass just fine. But in 60 seconds there are bound to be other requests for the same pg, especially since it holds the index file. Each of these requests lock another disk/op thread to the point where there are no free threads left to process any requests for any pg. Causing slow-requests. So first of all thanks if you can make it here, and sorry for the involved mail, i'm exploring the problem as i go. Now, is that deep-scrubbing situation i tried to theorize even possible? If not can you point us where to look further. We are currently running 0.72.2 and know about newer ioprio settings in Firefly and such. While we are planning to upgrade in a few weeks but i don't think those options will help us in any way. Am i correct? Are there any other improvements that we are not aware? This is all basically correct; it's one of the reasons you don't want to let individual buckets get too large. That said, I'm a little confused about why you're running listomapkeys that way. RGW throttles itself by getting only a certain number of entries at a time (1000?) and any system you're also building should do the same. That would reduce the frequency of any issues, and I *think* that scrubbing has some mitigating factors to help (although maybe not; it's been a while since I looked at any of that stuff). Although I just realized that my vague memory of deep scrubbing working better might be based on improvements that only got in for firefly...not sure. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some long running ops may lock osd
Hi Gregory, We are not using listomapkeys that way or in any way to be precise. I used it here just to reproduce the behavior/issue. What i am really interested in is if scrubbing-deep actually mitigates the problem and/or is there something that can be further improved. Or i guess we should go upgrade now and hope for the best :) On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com wrote: On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi all, especially devs, We have recently pinpointed one of the causes of slow requests in our cluster. It seems deep-scrubs on pg's that contain the index file for a large radosgw bucket lock the osds. Incresing op threads and/or disk threads helps a little bit, but we need to increase them beyond reason in order to completely get rid of the problem. A somewhat similar (and more severe) version of the issue occurs when we call listomapkeys for the index file, and since the logs for deep-scrubbing was much harder read, this inspection was based on listomapkeys. In this example osd.121 is the primary of pg 10.c91 which contains file .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains ~500k objects. Standard listomapkeys call take about 3 seconds. time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null real 0m2.983s user 0m0.760s sys 0m0.148s In order to lock the osd we request 2 of them simultaneously with something like: rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null sleep 1 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 'debug_osd=30' logs show the flow like: At t0 some thread enqueue_op's my omap-get-keys request. Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k keys. Op-Thread B responds to several other requests during that 1 second sleep. They're generally extremely fast subops on other pgs. At t1 (about a second later) my second omap-get-keys request gets enqueue_op'ed. But it does not start probably because of the lock held by Thread A. After that point other threads enqueue_op other requests on other pgs too but none of them starts processing, in which i consider the osd is locked. At t2 (about another second later) my first omap-get-keys request is finished. Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts reading ~500k keys again. Op-Thread A continues to process the requests enqueued in t1-t2. It seems Op-Thread B is waiting on the lock held by Op-Thread A while it can process other requests for other pg's just fine. My guess is a somewhat larger scenario happens in deep-scrubbing, like on the pg containing index for the bucket of 20M objects. A disk/op thread starts reading through the omap which will take say 60 seconds. During the first seconds, other requests for other pgs pass just fine. But in 60 seconds there are bound to be other requests for the same pg, especially since it holds the index file. Each of these requests lock another disk/op thread to the point where there are no free threads left to process any requests for any pg. Causing slow-requests. So first of all thanks if you can make it here, and sorry for the involved mail, i'm exploring the problem as i go. Now, is that deep-scrubbing situation i tried to theorize even possible? If not can you point us where to look further. We are currently running 0.72.2 and know about newer ioprio settings in Firefly and such. While we are planning to upgrade in a few weeks but i don't think those options will help us in any way. Am i correct? Are there any other improvements that we are not aware? This is all basically correct; it's one of the reasons you don't want to let individual buckets get too large. That said, I'm a little confused about why you're running listomapkeys that way. RGW throttles itself by getting only a certain number of entries at a time (1000?) and any system you're also building should do the same. That would reduce the frequency of any issues, and I *think* that scrubbing has some mitigating factors to help (although maybe not; it's been a while since I looked at any of that stuff). Although I just realized that my vague memory of deep scrubbing working better might be based on improvements that only got in for firefly...not sure. -Greg -- erdem agaoglu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com