On Mon, Oct 7, 2013 at 7:05 PM, Ned Bass <[email protected]> wrote:

> Here's something I noticed while evaluating the "4045 zfs write throttle
> & i/o scheduler performance work" patch on Linux.
>
> In dmu_tx_delay, the amount of time to delay a transaction is calculated
> like this:
>
> 1031         now = gethrtime();
> 1032         min_tx_time = zfs_delay_scale *
> 1033             (dirty - delay_min_bytes) / (zfs_dirty_data_max - dirty);
> 1034         if (now > tx->tx_start + min_tx_time)
> 1035                 return;
> 1036
> 1037         min_tx_time = MIN(min_tx_time, zfs_delay_max_ns);
> 1038
> 1039         DTRACE_PROBE3(delay__mintime, dmu_tx_t *, tx, uint64_t, dirty,
> 1040             uint64_t, min_tx_time);
> 1041
> 1042         mutex_enter(&dp->dp_lock);
> 1043         wakeup = MAX(tx->tx_start + min_tx_time,
> 1044             dp->dp_last_wakeup + min_tx_time);
> 1045         dp->dp_last_wakeup = wakeup;
> 1046         mutex_exit(&dp->dp_lock);
> 1047
> 1048 #ifdef _KERNEL
> 1049         mutex_enter(&curthread->t_delay_lock);
> 1050         while (cv_timedwait_hires(&curthread->t_delay_cv,
> 1051             &curthread->t_delay_lock, wakeup, zfs_delay_resolution_ns,
> 1052             CALLOUT_FLAG_ABSOLUTE | CALLOUT_FLAG_ROUNDUP) > 0)
> 1053                 continue;
> 1054         mutex_exit(&curthread->t_delay_lock);
> 1055 #else
> 1056         hrtime_t delta = wakeup - gethrtime();
> 1057         struct timespec ts;
> 1058         ts.tv_sec = delta / NANOSEC;
> 1059         ts.tv_nsec = delta % NANOSEC;
> 1060         (void) nanosleep(&ts, NULL);
> 1061 #endif
>
> One line 1035 we return if the tx has already been delayed long enough,
> which would normally prevent wakeup from getting a time in the past on
> line 1043.  However, line 1037 may lower the value of min_tx_time enough
> to cause a past wakeup time after all.


Agreed, although this should be unusual, given that zfs_delay_max_ns is
0.1seconds.


> This may not cause a problem in
> the kernel code since it uses an absolute time.  But the userspace code
> calculates a relative delay time which would become negative in this
> case, and may be interpreted as a really large positive value by
> nanosleep().
>

nanosleep() should return EINVAL if passed a negative delta.  (Linux
manpage says so, and though the illumos manpage doesn't document it, I
checked the implementation and it does.)


>
> So, shouldn't line 1037 be moved above line 1034?
>

That seems like a good idea anyway, but it is not sufficient to prevent
passing negative times to nanosleep(). Even if we ensure that wakeup > now,
 delta is calculated relative to gethrtime(), rather than "now".  We should
probably change that too.

--matt
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to