On Thu, Oct 10, 2013 at 1:20 PM, Ned Bass <[email protected]> wrote: > On Thu, Oct 10, 2013 at 11:01:49AM -0700, Matthew Ahrens wrote: > > > > One line 1035 we return if the tx has already been delayed long > enough, > > which would normally prevent wakeup from getting a time in the past > on > > line 1043. However, line 1037 may lower the value of min_tx_time > enough > > to cause a past wakeup time after all. > > > > > > Agreed, although this should be unusual, given that zfs_delay_max_ns is > > 0.1seconds. > > In my testing I observe that dmu_tx_delay() _always_ returns here > under normal conditions: > > 1034 if (now > tx->tx_start + min_tx_time) > 1035 return; > > I haven't yet explained why this is the case. It must either be that > dirty is staying very close to dirty_delay_min_bytes, or some overhead > higher up in the call path has already incurred enough delay. >
The delay would only kick in when the application is writing faster than the disk can keep up. If you just have a few spinning disks, you can probably hit it with "dd if=/dev/zero of=/pool/bigfile bs=1024k". Might need several instances of that (to different files) if you have lots of fast disks or few slow CPUs. > > The only exception to this is when I dynamically tune zfs_dirty_data_max > downward by a large amount. In that case dirty is close enough to the > new max that min_tx_time is initially around 3s. But then when it is > truncated to 100ms, wakeup gets a timestamp in the past. And because I > implemented the kernel delay using msleep(), which takes a relative > time, my system hangs. > > This raises another question I've been meaning to ask: why was the delay > implemented using cv_timedwait_hires()? I'm not familiar with timer > APIs in Illumos, but it seems like this could have be done in a simpler > way that would be easier to emulate on other platforms, such as a simple > delay or sleep call. Yes, we could rewrite it for Linux, but I'd prefer > to minimize differences in core code like this. > What routine would you prefer be used? There's no msleep() in illumos. Perhaps we should create a "sleep until this time" interface to wrap the (admittedly ugly) mutex + CV + timedwait(). Something to consider for the common codebase. > > The primary portability issues for Linux are > > - We don't have the equivalent of curthread->t_delay_lock, so we'd have > to add a new lock somewhere to implement this > > - I believe nanosecond-resolution timers in Linux just busy-wait, so it > would be inefficient, and we don't really need such high resolution in > this context, right? > That's right. In fact, you'll see that zfs_delay_resolution_ns is 100 microseconds. > > > So, shouldn't line 1037 be moved above line 1034? > > > > > > That seems like a good idea anyway, but it is not sufficient to prevent > passing > > negative times to nanosleep(). Even if we ensure that wakeup > now, > delta is > > calculated relative to gethrtime(), rather than "now". We should > probably > > change that too. > > As long as we're discussing this patch, I think a correction to this > comment is needed: > > /* > * This controls how quickly the delay approaches infinity. > * Larger values cause it to delay less for a given amount of dirty data. > * Therefore larger values will cause there to be more dirty data for a > * given throughput. > .. > */ > unsigned long zfs_delay_scale = 1000 * 1000 * 1000 / 2000; > > Isn't that backwards? I think the words "less" and "more" need to swap > positions. Yes, good catch! --matt
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
