On Mon, Oct 7, 2013 at 7:05 PM, Ned Bass <[email protected]> wrote: > Here's something I noticed while evaluating the "4045 zfs write throttle > & i/o scheduler performance work" patch on Linux. > > In dmu_tx_delay, the amount of time to delay a transaction is calculated > like this: > > 1031 now = gethrtime(); > 1032 min_tx_time = zfs_delay_scale * > 1033 (dirty - delay_min_bytes) / (zfs_dirty_data_max - dirty); > 1034 if (now > tx->tx_start + min_tx_time) > 1035 return; > 1036 > 1037 min_tx_time = MIN(min_tx_time, zfs_delay_max_ns); > 1038 > 1039 DTRACE_PROBE3(delay__mintime, dmu_tx_t *, tx, uint64_t, dirty, > 1040 uint64_t, min_tx_time); > 1041 > 1042 mutex_enter(&dp->dp_lock); > 1043 wakeup = MAX(tx->tx_start + min_tx_time, > 1044 dp->dp_last_wakeup + min_tx_time); > 1045 dp->dp_last_wakeup = wakeup; > 1046 mutex_exit(&dp->dp_lock); > 1047 > 1048 #ifdef _KERNEL > 1049 mutex_enter(&curthread->t_delay_lock); > 1050 while (cv_timedwait_hires(&curthread->t_delay_cv, > 1051 &curthread->t_delay_lock, wakeup, zfs_delay_resolution_ns, > 1052 CALLOUT_FLAG_ABSOLUTE | CALLOUT_FLAG_ROUNDUP) > 0) > 1053 continue; > 1054 mutex_exit(&curthread->t_delay_lock); > 1055 #else > 1056 hrtime_t delta = wakeup - gethrtime(); > 1057 struct timespec ts; > 1058 ts.tv_sec = delta / NANOSEC; > 1059 ts.tv_nsec = delta % NANOSEC; > 1060 (void) nanosleep(&ts, NULL); > 1061 #endif > > One line 1035 we return if the tx has already been delayed long enough, > which would normally prevent wakeup from getting a time in the past on > line 1043. However, line 1037 may lower the value of min_tx_time enough > to cause a past wakeup time after all.
Agreed, although this should be unusual, given that zfs_delay_max_ns is 0.1seconds. > This may not cause a problem in > the kernel code since it uses an absolute time. But the userspace code > calculates a relative delay time which would become negative in this > case, and may be interpreted as a really large positive value by > nanosleep(). > nanosleep() should return EINVAL if passed a negative delta. (Linux manpage says so, and though the illumos manpage doesn't document it, I checked the implementation and it does.) > > So, shouldn't line 1037 be moved above line 1034? > That seems like a good idea anyway, but it is not sufficient to prevent passing negative times to nanosleep(). Even if we ensure that wakeup > now, delta is calculated relative to gethrtime(), rather than "now". We should probably change that too. --matt
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
