Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

Samuel Just Thu, 09 Feb 2017 11:22:20 -0800

Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep (based on
master) passed a rados suite.  It adds a configurable limit to the number
of pgs which can be trimming on any OSD (default: 2).  PGs trimming will be
in snaptrim state, PGs waiting to trim will be in snaptrim_wait state.  I
suspect this'll be adequate to throttle the amount of trimming.  If not, I
can try to add an explicit limit to the rate at which the work items
trickle into the queue.  Can someone test this branch?   Tester beware:
this has not merged into master yet and should only be run on a disposable
cluster.
-Sam


On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk <n...@fisk.me.uk> wrote:

> Yeah it’s probably just the fact that they have more PG’s so they will
> hold more data and thus serve more IO. As they have a fixed IO limit, they
> will always hit the limit first and become the bottleneck.
>
>
>
> The main problem with reducing the filestore queue is that I believe you
> will start to lose the benefit of having IO’s queued up on the disk, so
> that the scheduler can re-arrange them to action them in the most efficient
> manor as the disk head moves across the platters. You might possibly see up
> to a 20% hit on performance, in exchange for more consistent client
> latency.
>
>
>
> *From:* Steve Taylor [mailto:steve.tay...@storagecraft.com]
> *Sent:* 07 February 2017 20:35
> *To:* n...@fisk.me.uk; ceph-users@lists.ceph.com
>
> *Subject:* RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Thanks, Nick.
>
>
>
> One other data point that has come up is that nearly all of the blocked
> requests that are waiting on subops are waiting for OSDs with more PGs than
> the others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB
> OSDs. The cluster is well balanced based on OSD capacity, so those 7 OSDs
> individually have 33% more PGs than the others and are causing almost all
> of the blocked requests. It appears that maps updates are generally not
> blocking long enough to show up as blocked requests.
>
>
>
> I set the reweight on those 7 OSDs to 0.75 and things are backfilling now.
> I’ll test some more when the PG counts per OSD are more balanced and see
> what I get. I’ll also play with the filestore queue. I was telling some of
> my colleagues yesterday that this looked likely to be related to buffer
> bloat somewhere. I appreciate the suggestion.
>
>
> ------------------------------
>
>
> <http://xo4t.mj.am/lnk/ADsAAGVExY4AAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYmjkujXMGsfv0QI2IkdzdMPHbOwAAlBI/1/GrSPF56Fv6UuTsRTz1TnrQ/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t>
>
> *Steve* *Taylor* | Senior Software Engineer | StorageCraft Technology
> Corporation
> <http://xo4t.mj.am/lnk/ADsAAGVExY4AAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYmjkujXMGsfv0QI2IkdzdMPHbOwAAlBI/2/HleRei3YWDdicmCuDoWytA/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> *Office: *801.871.2799 <(801)%20871-2799> |
> ------------------------------
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
> ------------------------------
>
> *From:* Nick Fisk [mailto:n...@fisk.me.uk]
> *Sent:* Tuesday, February 7, 2017 10:25 AM
> *To:* Steve Taylor <steve.tay...@storagecraft.com>;
> ceph-users@lists.ceph.com
> *Subject:* RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Hi Steve,
>
>
>
> From what I understand, the issue is not with the queueing in Ceph, which
> is correctly moving client IO to the front of the queue. The problem lies
> below what Ceph controls, ie the scheduler and disk layer in Linux. Once
> the IO’s leave Ceph it’s a bit of a free for all and the client IO’s tend
> to get lost in large disk queues surrounded by all the snap trim IO’s.
>
>
>
> The workaround Sam is working on will limit the amount of snap trims that
> are allowed to run, which I believe will have a similar effect to the sleep
> parameters in pre-jewel clusters, but without pausing the whole IO thread.
>
>
>
> Ultimately the solution requires Ceph to be able to control the queuing of
> IO’s at the lower levels of the kernel. Whether this is via some sort of
> tagging per IO (currently CFQ is only per thread/process) or some other
> method, I don’t know. I was speaking to Sage and he thinks the easiest
> method might be to shrink the filestore queue so that you don’t get buffer
> bloat at the disk level. You should be able to test this out pretty easily
> now by changing the parameter, probably around a queue of 5-10 would be
> about right for spinning disks. It’s a trade off of peak throughput vs
> queue latency though.
>
>
>
> Nick
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> <ceph-users-boun...@lists.ceph.com>] *On Behalf Of *Steve Taylor
> *Sent:* 07 February 2017 17:01
> *To:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> As I look at more of these stuck ops, it looks like more of them are
> actually waiting on subops than on osdmap updates, so maybe there is still
> some headway to be made with the weighted priority queue settings. I do see
> OSDs waiting for map updates all the time, but they aren’t blocking things
> as much as the subops are. Thoughts?
>
>
> ------------------------------
>
>
> <http://xo4t.mj.am/lnk/ADsAAGVExY4AAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYmjkujXMGsfv0QI2IkdzdMPHbOwAAlBI/3/gmxBQ4dulhCLgdaXYwjzXQ/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFQUFIZFhfTlY4QUFBQUFBQUFBQUYzZ2RxNEFBRE5KQld3QUFBQUFBQUNSWHdCWW1nTDJ2Mkpqcl9PLVIyTzI0MEpiWXN5WWVnQUFsQkkvMS9vY3RoeTZnc3VsLTlHSlk1TENwY2FBL2FIUjBjSE02THk5emRHOXlZV2RsWTNKaFpuUXVZMjl0>
>
> *Steve* *Taylor* | Senior Software Engineer | StorageCraft Technology
> Corporation
> <http://xo4t.mj.am/lnk/ADsAAGVExY4AAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYmjkujXMGsfv0QI2IkdzdMPHbOwAAlBI/4/FusWt4f2DrtfAg_Rl1Xzpg/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFQUFIZFhfTlY4QUFBQUFBQUFBQUYzZ2RxNEFBRE5KQld3QUFBQUFBQUNSWHdCWW1nTDJ2Mkpqcl9PLVIyTzI0MEpiWXN5WWVnQUFsQkkvMi90RU1EODM0ZHVnOEZpWWx6QmRuRERnL2FIUjBjSE02THk5emRHOXlZV2RsWTNKaFpuUXVZMjl0>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> *Office: *801.871.2799 <(801)%20871-2799> |
> ------------------------------
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
> ------------------------------
>
> *From:* Steve Taylor
> *Sent:* Tuesday, February 7, 2017 9:13 AM
> *To:* 'ceph-users@lists.ceph.com' <ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Sorry, I lost the previous thread on this. I apologize for the resulting
> incomplete reply.
>
>
>
> The issue that we’re having with Jewel, as David Turner mentioned, is that
> we can’t seem to throttle snap trimming sufficiently to prevent it from
> blocking I/O requests. On further investigation, I encountered
> osd_op_pq_max_tokens_per_priority, which should be able to be used in
> conjunction with ‘osd_op_queue = wpq’ to govern the availability of queue
> positions for various operations using costs if I understand correctly. I’m
> testing with RBDs using 4MB objects, so in order to leave plenty of room in
> the weighted priority queue for client I/O, I set 
> osd_op_pq_max_tokens_per_priority
> to 64MB and osd_snap_trim_cost to 32MB+1. I figured this should essentially
> reserve 32MB in the queue for client I/O operations, which are prioritized
> higher and therefore shouldn’t get blocked.
>
>
>
> I still see blocked I/O requests, and when I dump in-flight ops, they show
> ‘op must wait for map.’ I assume this means that what’s blocking the I/O
> requests at this point is all of the osdmap updates caused by snap
> trimming, and not the actual snap trimming itself starving the ops of op
> threads. Hammer is able to mitigate this with osd_snap_trim_sleep by
> directly throttling snap trimming and therefore causing less frequent
> osdmap updates, but there doesn’t seem to be a good way to accomplish the
> same thing with Jewel.
>
>
>
> First of all, am I understanding these settings correctly? If so, are
> there other settings that could potentially help here, or do we just need
> something like Sam already mentioned that can sort of reserve threads for
> client I/O requests? Even then it seems like we might have issues if we
> can’t also throttle snap trimming. We delete a LOT of RBD snapshots on a
> daily basis, which we recognize is an extreme use case. Just wondering if
> there’s something else to try or if we need to start working toward
> implementing something new ourselves to handle our use case better.
>
>
> [image: Image removed by sender.]
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

Reply via email to