Here's something I noticed while evaluating the "4045 zfs write throttle & i/o scheduler performance work" patch on Linux.
In dmu_tx_delay, the amount of time to delay a transaction is calculated like this: 1031 now = gethrtime(); 1032 min_tx_time = zfs_delay_scale * 1033 (dirty - delay_min_bytes) / (zfs_dirty_data_max - dirty); 1034 if (now > tx->tx_start + min_tx_time) 1035 return; 1036 1037 min_tx_time = MIN(min_tx_time, zfs_delay_max_ns); 1038 1039 DTRACE_PROBE3(delay__mintime, dmu_tx_t *, tx, uint64_t, dirty, 1040 uint64_t, min_tx_time); 1041 1042 mutex_enter(&dp->dp_lock); 1043 wakeup = MAX(tx->tx_start + min_tx_time, 1044 dp->dp_last_wakeup + min_tx_time); 1045 dp->dp_last_wakeup = wakeup; 1046 mutex_exit(&dp->dp_lock); 1047 1048 #ifdef _KERNEL 1049 mutex_enter(&curthread->t_delay_lock); 1050 while (cv_timedwait_hires(&curthread->t_delay_cv, 1051 &curthread->t_delay_lock, wakeup, zfs_delay_resolution_ns, 1052 CALLOUT_FLAG_ABSOLUTE | CALLOUT_FLAG_ROUNDUP) > 0) 1053 continue; 1054 mutex_exit(&curthread->t_delay_lock); 1055 #else 1056 hrtime_t delta = wakeup - gethrtime(); 1057 struct timespec ts; 1058 ts.tv_sec = delta / NANOSEC; 1059 ts.tv_nsec = delta % NANOSEC; 1060 (void) nanosleep(&ts, NULL); 1061 #endif One line 1035 we return if the tx has already been delayed long enough, which would normally prevent wakeup from getting a time in the past on line 1043. However, line 1037 may lower the value of min_tx_time enough to cause a past wakeup time after all. This may not cause a problem in the kernel code since it uses an absolute time. But the userspace code calculates a relative delay time which would become negative in this case, and may be interpreted as a really large positive value by nanosleep(). So, shouldn't line 1037 be moved above line 1034? Thanks, Ned _______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
