Yeah, I think you're probably right. The answer is probably to add an explicit rate-limiting element to the way the snaptrim events are scheduled. -Sam
On Thu, Jan 19, 2017 at 1:34 PM, Nick Fisk <[email protected]> wrote: > I will give those both a go and report back, but the more I thinking about > this the less I'm convinced that it's going to help. > > I think the problem is a general IO imbalance, there is probably something > like 100+ times more trimming IO than client IO and so even if client IO gets > promoted to the front of the queue by Ceph, once it hits the Linux IO layer > its fighting for itself. I guess this approach works with scrubbing as each > read IO has to wait to be read before the next one is submitted, so the queue > can be managed on the OSD. With trimming, writes can buffer up below what the > OSD controls. > > I don't know if the snap trimming goes nuts because the journals are acking > each request and the spinning disks can't keep up, or if it's something else. > Does WBThrottle get involved with snap trimming? > > But from an underlying disk perspective, there is definitely more than 2 > snaps per OSD at a time going on, even if the OSD itself is not processing > more than 2 at a time. I think there either needs to be another knob so that > Ceph can throttle back snaps, not just de-prioritise them. Or, there needs a > whole new kernel interface where an application can priority tag individual > IO's for CFQ to handle, instead of the current limitation of priority per > thread, I realise this is probably very very hard or impossible. But it would > allow Ceph to control IO queue's right down to the disk. > >> -----Original Message----- >> From: Samuel Just [mailto:[email protected]] >> Sent: 19 January 2017 18:58 >> To: Nick Fisk <[email protected]> >> Cc: Dan van der Ster <[email protected]>; ceph-users >> <[email protected]> >> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep? >> >> Have you also tried setting osd_snap_trim_cost to be 16777216 (16x the >> default value, equal to a 16MB IO) and >> osd_pg_max_concurrent_snap_trims to 1 (from 2)? >> -Sam >> >> On Thu, Jan 19, 2017 at 7:57 AM, Nick Fisk <[email protected]> wrote: >> > Hi Sam, >> > >> > Thanks for the confirmation on both which thread the trimming happens in >> > and for confirming my suspicion that sleeping is now a >> bad idea. >> > >> > The problem I see is that even with setting the priority for trimming down >> > low, it still seems to completely swamp the cluster. The >> trims seem to get submitted in an async nature which seems to leave all my >> disks sitting at queue depths of 50+ for several minutes >> until the snapshot is removed, often also causing several OSD's to get >> marked out and start flapping. I'm using WPQ but haven't >> changed the cutoff variable yet as I know you are working on fixing a bug >> with that. >> > >> > Nick >> > >> >> -----Original Message----- >> >> From: Samuel Just [mailto:[email protected]] >> >> Sent: 19 January 2017 15:47 >> >> To: Dan van der Ster <[email protected]> >> >> Cc: Nick Fisk <[email protected]>; ceph-users >> >> <[email protected]> >> >> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep? >> >> >> >> Snaptrimming is now in the main op threadpool along with scrub, >> >> recovery, and client IO. I don't think it's a good idea to use any of >> >> the _sleep configs anymore -- the intention is that by setting the >> priority low, they won't actually be scheduled much. >> >> -Sam >> >> >> >> On Thu, Jan 19, 2017 at 5:40 AM, Dan van der Ster <[email protected]> >> >> wrote: >> >> > On Thu, Jan 19, 2017 at 1:28 PM, Nick Fisk <[email protected]> wrote: >> >> >> Hi Dan, >> >> >> >> >> >> I carried out some more testing after doubling the op threads, it >> >> >> may have had a small benefit as potentially some threads are >> >> >> available, but latency still sits more or less around the >> >> >> configured snap sleep time. Even more threads might help, but I >> >> >> suspect you are just >> >> lowering the chance of IO's that are stuck behind the sleep, rather than >> >> actually solving the problem. >> >> >> >> >> >> I'm guessing when the snap trimming was in disk thread, you >> >> >> wouldn't have noticed these sleeps, but now it's in the op thread >> >> >> it will just sit there holding up all IO and be a lot more >> >> >> noticable. It might be >> >> that this option shouldn't be used with Jewel+? >> >> > >> >> > That's a good thought -- so we need confirmation which thread is >> >> > doing the snap trimming. I honestly can't figure it out from the >> >> > code -- hopefully a dev could explain how it works. >> >> > >> >> > Otherwise, I don't have much practical experience with snap >> >> > trimming in jewel yet -- our RBD cluster is still running 0.94.9. >> >> > >> >> > Cheers, Dan >> >> > >> >> > >> >> >> >> >> >>> -----Original Message----- >> >> >>> From: ceph-users [mailto:[email protected]] On >> >> >>> Behalf Of Nick Fisk >> >> >>> Sent: 13 January 2017 20:38 >> >> >>> To: 'Dan van der Ster' <[email protected]> >> >> >>> Cc: 'ceph-users' <[email protected]> >> >> >>> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during >> >> >>> sleep? >> >> >>> >> >> >>> We're on Jewel and your right, I'm pretty sure the snap stuff is also >> >> >>> now handled in the op thread. >> >> >>> >> >> >>> The dump historic ops socket command showed a 10s delay at the >> >> >>> "Reached PG" stage, from Greg's response [1], it would suggest >> >> >>> that the OSD itself isn't blocking but the PG it's currently >> >> >>> sleeping whilst trimming. I think in the former case, it would >> >> >>> have a >> >> >> high time >> >> >>> on the "Started" part of the op? Anyway I will carry out some >> >> >>> more testing with higher osd op threads and see if that makes any >> >> >>> difference. Thanks for the suggestion. >> >> >>> >> >> >>> Nick >> >> >>> >> >> >>> >> >> >>> [1] >> >> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/00 >> >> >>> 865 >> >> >>> 2.html >> >> >>> >> >> >>> > -----Original Message----- >> >> >>> > From: Dan van der Ster [mailto:[email protected]] >> >> >>> > Sent: 13 January 2017 10:28 >> >> >>> > To: Nick Fisk <[email protected]> >> >> >>> > Cc: ceph-users <[email protected]> >> >> >>> > Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during >> >> >>> > sleep? >> >> >>> > >> >> >>> > Hammer or jewel? I've forgotten which thread pool is handling >> >> >>> > the snap trim nowadays -- is it the op thread yet? If so, >> >> >>> > perhaps all the op threads are stuck sleeping? Just a wild >> >> >>> > guess. (Maybe >> >> >> increasing # >> >> >>> op threads would help?). >> >> >>> > >> >> >>> > -- Dan >> >> >>> > >> >> >>> > >> >> >>> > On Thu, Jan 12, 2017 at 3:11 PM, Nick Fisk <[email protected]> wrote: >> >> >>> > > Hi, >> >> >>> > > >> >> >>> > > I had been testing some higher values with the >> >> >>> > > osd_snap_trim_sleep variable to try and reduce the impact of >> >> >>> > > removing RBD snapshots on our cluster and I have come across >> >> >>> > > what I believe to be a possible unintended consequence. The >> >> >>> > > value of the sleep seems to keep the >> >> >>> > lock on the PG open so that no other IO can use the PG whilst the >> >> >>> > snap removal operation is sleeping. >> >> >>> > > >> >> >>> > > I had set the variable to 10s to completely minimise the >> >> >>> > > impact as I had some multi TB snapshots to remove and noticed >> >> >>> > > that suddenly all IO to the cluster had a latency of roughly >> >> >>> > > 10s as well, all the >> >> >>> > dumped ops show waiting on PG for 10s as well. >> >> >>> > > >> >> >>> > > Is the osd_snap_trim_sleep variable only ever meant to be >> >> >>> > > used up to say a max of 0.1s and this is a known side effect, >> >> >>> > > or should the lock on the PG be removed so that normal IO can >> >> >>> > > continue during the >> >> >>> > sleeps? >> >> >>> > > >> >> >>> > > Nick >> >> >>> > > >> >> >>> > > _______________________________________________ >> >> >>> > > ceph-users mailing list >> >> >>> > > [email protected] >> >> >>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >>> >> >> >>> _______________________________________________ >> >> >>> ceph-users mailing list >> >> >>> [email protected] >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> > _______________________________________________ >> >> > ceph-users mailing list >> >> > [email protected] >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
