Re: [Developer] dmu_tx_delay() may calculate wakeup time in the past

Ned Bass Thu, 10 Oct 2013 13:22:23 -0700

On Thu, Oct 10, 2013 at 11:01:49AM -0700, Matthew Ahrens wrote:
> 
>     One line 1035 we return if the tx has already been delayed long enough,
>     which would normally prevent wakeup from getting a time in the past on
>     line 1043.  However, line 1037 may lower the value of min_tx_time enough
>     to cause a past wakeup time after all.  
> 
> 
> Agreed, although this should be unusual, given that zfs_delay_max_ns is
> 0.1seconds.


In my testing I observe that dmu_tx_delay() _always_ returns here
under normal conditions:

    1034         if (now > tx->tx_start + min_tx_time)
    1035                 return;

I haven't yet explained why this is the case.  It must either be that
dirty is staying very close to dirty_delay_min_bytes, or some overhead
higher up in the call path has already incurred enough delay.

The only exception to this is when I dynamically tune zfs_dirty_data_max
downward by a large amount.  In that case dirty is close enough to the
new max that min_tx_time is initially around 3s.  But then when it is
truncated to 100ms, wakeup gets a timestamp in the past.  And because I
implemented the kernel delay using msleep(), which takes a relative
time, my system hangs.

This raises another question I've been meaning to ask: why was the delay
implemented using cv_timedwait_hires()?  I'm not familiar with timer
APIs in Illumos, but it seems like this could have be done in a simpler
way that would be easier to emulate on other platforms, such as a simple
delay or sleep call.  Yes, we could rewrite it for Linux, but I'd prefer
to minimize differences in core code like this.

The primary portability issues for Linux are

- We don't have the equivalent of curthread->t_delay_lock, so we'd have
  to add a new lock somewhere to implement this

- I believe nanosecond-resolution timers in Linux just busy-wait, so it
  would be inefficient, and we don't really need such high resolution in
  this context, right?

>     So, shouldn't line 1037 be moved above line 1034?
> 
> 
> That seems like a good idea anyway, but it is not sufficient to prevent 
> passing
> negative times to nanosleep(). Even if we ensure that wakeup > now,  delta is
> calculated relative to gethrtime(), rather than "now".  We should probably
> change that too.

As long as we're discussing this patch, I think a correction to this
comment is needed:

/*
 * This controls how quickly the delay approaches infinity.
 * Larger values cause it to delay less for a given amount of dirty data.
 * Therefore larger values will cause there to be more dirty data for a
 * given throughput.
...
 */
unsigned long zfs_delay_scale = 1000 * 1000 * 1000 / 2000;

Isn't that backwards?  I think the words "less" and "more" need to swap
positions.

Thanks,
Ned
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Re: [Developer] dmu_tx_delay() may calculate wakeup time in the past

Reply via email to