Yeah, I think you're probably right.  The answer is probably to add an
explicit rate-limiting element to the way the snaptrim events are
scheduled.
-Sam

On Thu, Jan 19, 2017 at 1:34 PM, Nick Fisk <[email protected]> wrote:
> I will give those both a go and report back, but the more I thinking about 
> this the less I'm convinced that it's going to help.
>
> I think the problem is a general IO imbalance, there is probably something 
> like 100+ times more trimming IO than client IO and so even if client IO gets 
> promoted to the front of the queue by Ceph, once it hits the Linux IO layer 
> its fighting for itself. I guess this approach works with scrubbing as each 
> read IO has to wait to be read before the next one is submitted, so the queue 
> can be managed on the OSD. With trimming, writes can buffer up below what the 
> OSD controls.
>
> I don't know if the snap trimming goes nuts because the journals are acking 
> each request and the spinning disks can't keep up, or if it's something else. 
> Does WBThrottle get involved with snap trimming?
>
> But from an underlying disk perspective, there is definitely more than 2 
> snaps per OSD at a time going on, even if the OSD itself is not processing 
> more than 2 at a time. I think there either needs to be another knob so that 
> Ceph can throttle back snaps, not just de-prioritise them. Or, there needs a 
> whole new kernel interface where an application can priority tag individual 
> IO's for CFQ to handle, instead of the current limitation of priority per 
> thread, I realise this is probably very very hard or impossible. But it would 
> allow Ceph to control IO queue's right down to the disk.
>
>> -----Original Message-----
>> From: Samuel Just [mailto:[email protected]]
>> Sent: 19 January 2017 18:58
>> To: Nick Fisk <[email protected]>
>> Cc: Dan van der Ster <[email protected]>; ceph-users 
>> <[email protected]>
>> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?
>>
>> Have you also tried setting osd_snap_trim_cost to be 16777216 (16x the 
>> default value, equal to a 16MB IO) and
>> osd_pg_max_concurrent_snap_trims to 1 (from 2)?
>> -Sam
>>
>> On Thu, Jan 19, 2017 at 7:57 AM, Nick Fisk <[email protected]> wrote:
>> > Hi Sam,
>> >
>> > Thanks for the confirmation on both which thread the trimming happens in 
>> > and for confirming my suspicion that sleeping is now a
>> bad idea.
>> >
>> > The problem I see is that even with setting the priority for trimming down 
>> > low, it still seems to completely swamp the cluster. The
>> trims seem to get submitted in an async nature which seems to leave all my 
>> disks sitting at queue depths of 50+ for several minutes
>> until the snapshot is removed, often also causing several OSD's to get 
>> marked out and start flapping. I'm using WPQ but haven't
>> changed the cutoff variable yet as I know you are working on fixing a bug 
>> with that.
>> >
>> > Nick
>> >
>> >> -----Original Message-----
>> >> From: Samuel Just [mailto:[email protected]]
>> >> Sent: 19 January 2017 15:47
>> >> To: Dan van der Ster <[email protected]>
>> >> Cc: Nick Fisk <[email protected]>; ceph-users
>> >> <[email protected]>
>> >> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?
>> >>
>> >> Snaptrimming is now in the main op threadpool along with scrub,
>> >> recovery, and client IO.  I don't think it's a good idea to use any of 
>> >> the _sleep configs anymore -- the intention is that by setting the
>> priority low, they won't actually be scheduled much.
>> >> -Sam
>> >>
>> >> On Thu, Jan 19, 2017 at 5:40 AM, Dan van der Ster <[email protected]> 
>> >> wrote:
>> >> > On Thu, Jan 19, 2017 at 1:28 PM, Nick Fisk <[email protected]> wrote:
>> >> >> Hi Dan,
>> >> >>
>> >> >> I carried out some more testing after doubling the op threads, it
>> >> >> may have had a small benefit as potentially some threads are
>> >> >> available, but latency still sits more or less around the
>> >> >> configured snap sleep time. Even more threads might help, but I
>> >> >> suspect you are just
>> >> lowering the chance of IO's that are stuck behind the sleep, rather than 
>> >> actually solving the problem.
>> >> >>
>> >> >> I'm guessing when the snap trimming was in disk thread, you
>> >> >> wouldn't have noticed these sleeps, but now it's in the op thread
>> >> >> it will just sit there holding up all IO and be a lot more
>> >> >> noticable. It might be
>> >> that this option shouldn't be used with Jewel+?
>> >> >
>> >> > That's a good thought -- so we need confirmation which thread is
>> >> > doing the snap trimming. I honestly can't figure it out from the
>> >> > code -- hopefully a dev could explain how it works.
>> >> >
>> >> > Otherwise, I don't have much practical experience with snap
>> >> > trimming in jewel yet -- our RBD cluster is still running 0.94.9.
>> >> >
>> >> > Cheers, Dan
>> >> >
>> >> >
>> >> >>
>> >> >>> -----Original Message-----
>> >> >>> From: ceph-users [mailto:[email protected]] On
>> >> >>> Behalf Of Nick Fisk
>> >> >>> Sent: 13 January 2017 20:38
>> >> >>> To: 'Dan van der Ster' <[email protected]>
>> >> >>> Cc: 'ceph-users' <[email protected]>
>> >> >>> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during 
>> >> >>> sleep?
>> >> >>>
>> >> >>> We're on Jewel and your right, I'm pretty sure the snap stuff is also 
>> >> >>> now handled in the op thread.
>> >> >>>
>> >> >>> The dump historic ops socket command showed a 10s delay at the
>> >> >>> "Reached PG" stage, from Greg's response [1], it would suggest
>> >> >>> that the OSD itself isn't blocking but the PG it's currently
>> >> >>> sleeping whilst trimming. I think in the former case, it would
>> >> >>> have a
>> >> >> high time
>> >> >>> on the "Started" part of the op? Anyway I will carry out some
>> >> >>> more testing with higher osd op threads and see if that makes any 
>> >> >>> difference. Thanks for the suggestion.
>> >> >>>
>> >> >>> Nick
>> >> >>>
>> >> >>>
>> >> >>> [1]
>> >> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/00
>> >> >>> 865
>> >> >>> 2.html
>> >> >>>
>> >> >>> > -----Original Message-----
>> >> >>> > From: Dan van der Ster [mailto:[email protected]]
>> >> >>> > Sent: 13 January 2017 10:28
>> >> >>> > To: Nick Fisk <[email protected]>
>> >> >>> > Cc: ceph-users <[email protected]>
>> >> >>> > Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during 
>> >> >>> > sleep?
>> >> >>> >
>> >> >>> > Hammer or jewel? I've forgotten which thread pool is handling
>> >> >>> > the snap trim nowadays -- is it the op thread yet? If so,
>> >> >>> > perhaps all the op threads are stuck sleeping? Just a wild
>> >> >>> > guess. (Maybe
>> >> >> increasing #
>> >> >>> op threads would help?).
>> >> >>> >
>> >> >>> > -- Dan
>> >> >>> >
>> >> >>> >
>> >> >>> > On Thu, Jan 12, 2017 at 3:11 PM, Nick Fisk <[email protected]> wrote:
>> >> >>> > > Hi,
>> >> >>> > >
>> >> >>> > > I had been testing some higher values with the
>> >> >>> > > osd_snap_trim_sleep variable to try and reduce the impact of
>> >> >>> > > removing RBD snapshots on our cluster and I have come across
>> >> >>> > > what I believe to be a possible unintended consequence. The
>> >> >>> > > value of the sleep seems to keep the
>> >> >>> > lock on the PG open so that no other IO can use the PG whilst the 
>> >> >>> > snap removal operation is sleeping.
>> >> >>> > >
>> >> >>> > > I had set the variable to 10s to completely minimise the
>> >> >>> > > impact as I had some multi TB snapshots to remove and noticed
>> >> >>> > > that suddenly all IO to the cluster had a latency of roughly
>> >> >>> > > 10s as well, all the
>> >> >>> > dumped ops show waiting on PG for 10s as well.
>> >> >>> > >
>> >> >>> > > Is the osd_snap_trim_sleep variable only ever meant to be
>> >> >>> > > used up to say a max of 0.1s and this is a known side effect,
>> >> >>> > > or should the lock on the PG be removed so that normal IO can
>> >> >>> > > continue during the
>> >> >>> > sleeps?
>> >> >>> > >
>> >> >>> > > Nick
>> >> >>> > >
>> >> >>> > > _______________________________________________
>> >> >>> > > ceph-users mailing list
>> >> >>> > > [email protected]
>> >> >>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>>
>> >> >>> _______________________________________________
>> >> >>> ceph-users mailing list
>> >> >>> [email protected]
>> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>
>> >> > _______________________________________________
>> >> > ceph-users mailing list
>> >> > [email protected]
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to