On Thu, Oct 10, 2013 at 11:01:49AM -0700, Matthew Ahrens wrote:
>
> One line 1035 we return if the tx has already been delayed long enough,
> which would normally prevent wakeup from getting a time in the past on
> line 1043. However, line 1037 may lower the value of min_tx_time enough
> to cause a past wakeup time after all.
>
>
> Agreed, although this should be unusual, given that zfs_delay_max_ns is
> 0.1seconds.
In my testing I observe that dmu_tx_delay() _always_ returns here
under normal conditions:
1034 if (now > tx->tx_start + min_tx_time)
1035 return;
I haven't yet explained why this is the case. It must either be that
dirty is staying very close to dirty_delay_min_bytes, or some overhead
higher up in the call path has already incurred enough delay.
The only exception to this is when I dynamically tune zfs_dirty_data_max
downward by a large amount. In that case dirty is close enough to the
new max that min_tx_time is initially around 3s. But then when it is
truncated to 100ms, wakeup gets a timestamp in the past. And because I
implemented the kernel delay using msleep(), which takes a relative
time, my system hangs.
This raises another question I've been meaning to ask: why was the delay
implemented using cv_timedwait_hires()? I'm not familiar with timer
APIs in Illumos, but it seems like this could have be done in a simpler
way that would be easier to emulate on other platforms, such as a simple
delay or sleep call. Yes, we could rewrite it for Linux, but I'd prefer
to minimize differences in core code like this.
The primary portability issues for Linux are
- We don't have the equivalent of curthread->t_delay_lock, so we'd have
to add a new lock somewhere to implement this
- I believe nanosecond-resolution timers in Linux just busy-wait, so it
would be inefficient, and we don't really need such high resolution in
this context, right?
> So, shouldn't line 1037 be moved above line 1034?
>
>
> That seems like a good idea anyway, but it is not sufficient to prevent
> passing
> negative times to nanosleep(). Even if we ensure that wakeup > now, delta is
> calculated relative to gethrtime(), rather than "now". We should probably
> change that too.
As long as we're discussing this patch, I think a correction to this
comment is needed:
/*
* This controls how quickly the delay approaches infinity.
* Larger values cause it to delay less for a given amount of dirty data.
* Therefore larger values will cause there to be more dirty data for a
* given throughput.
...
*/
unsigned long zfs_delay_scale = 1000 * 1000 * 1000 / 2000;
Isn't that backwards? I think the words "less" and "more" need to swap
positions.
Thanks,
Ned
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer