Re: [Developer] dmu_tx_delay() may calculate wakeup time in the past

Matthew Ahrens Thu, 10 Oct 2013 14:14:59 -0700

On Thu, Oct 10, 2013 at 1:20 PM, Ned Bass <[email protected]> wrote:

> On Thu, Oct 10, 2013 at 11:01:49AM -0700, Matthew Ahrens wrote:
> >
> >     One line 1035 we return if the tx has already been delayed long
> enough,
> >     which would normally prevent wakeup from getting a time in the past
> on
> >     line 1043.  However, line 1037 may lower the value of min_tx_time
> enough
> >     to cause a past wakeup time after all.
> >
> >
> > Agreed, although this should be unusual, given that zfs_delay_max_ns is
> > 0.1seconds.
>
> In my testing I observe that dmu_tx_delay() _always_ returns here
> under normal conditions:
>
>     1034         if (now > tx->tx_start + min_tx_time)
>     1035                 return;
>
> I haven't yet explained why this is the case.  It must either be that
> dirty is staying very close to dirty_delay_min_bytes, or some overhead
> higher up in the call path has already incurred enough delay.
>


The delay would only kick in when the application is writing faster than
the disk can keep up.  If you just have a few spinning disks, you can
probably hit it with "dd if=/dev/zero of=/pool/bigfile bs=1024k".  Might
need several instances of that (to different files) if you have lots of
fast disks or few slow CPUs.


>
> The only exception to this is when I dynamically tune zfs_dirty_data_max
> downward by a large amount.  In that case dirty is close enough to the
> new max that min_tx_time is initially around 3s.  But then when it is
> truncated to 100ms, wakeup gets a timestamp in the past.  And because I
> implemented the kernel delay using msleep(), which takes a relative
> time, my system hangs.
>
> This raises another question I've been meaning to ask: why was the delay
> implemented using cv_timedwait_hires()?  I'm not familiar with timer
> APIs in Illumos, but it seems like this could have be done in a simpler
> way that would be easier to emulate on other platforms, such as a simple
> delay or sleep call.  Yes, we could rewrite it for Linux, but I'd prefer
> to minimize differences in core code like this.
>

What routine would you prefer be used?  There's no msleep() in illumos.
 Perhaps we should create a "sleep until this time" interface to wrap the
(admittedly ugly) mutex + CV + timedwait().  Something to consider for the
common codebase.


>
> The primary portability issues for Linux are
>
> - We don't have the equivalent of curthread->t_delay_lock, so we'd have
>   to add a new lock somewhere to implement this
>
> - I believe nanosecond-resolution timers in Linux just busy-wait, so it
>   would be inefficient, and we don't really need such high resolution in
>   this context, right?
>

That's right.  In fact, you'll see that zfs_delay_resolution_ns is 100
microseconds.


>
> >     So, shouldn't line 1037 be moved above line 1034?
> >
> >
> > That seems like a good idea anyway, but it is not sufficient to prevent
> passing
> > negative times to nanosleep(). Even if we ensure that wakeup > now,
>  delta is
> > calculated relative to gethrtime(), rather than "now".  We should
> probably
> > change that too.
>
> As long as we're discussing this patch, I think a correction to this
> comment is needed:
>
> /*
>  * This controls how quickly the delay approaches infinity.
>  * Larger values cause it to delay less for a given amount of dirty data.
>  * Therefore larger values will cause there to be more dirty data for a
>  * given throughput.
> ..
>  */
> unsigned long zfs_delay_scale = 1000 * 1000 * 1000 / 2000;
>
> Isn't that backwards?  I think the words "less" and "more" need to swap
> positions.


Yes, good catch!

--matt

_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Re: [Developer] dmu_tx_delay() may calculate wakeup time in the past

Reply via email to