On Wednesday, February 04, 2015 04:29:41 PM Konstantin Belousov wrote: > On Tue, Feb 03, 2015 at 01:33:15PM -0800, Peter Wemm wrote: > > Sometime in the Dec 10th through Jan 7th timeframe a timing bug has been > > introduced to 11.x/head/-current. With HZ=1000 (the default for bare > > metal, not for a vm); the clocks stop just after 24 days of uptime. This > > means things like cron, sleep, timeouts etc stop working. TCP/IP won't > > time out or retransmit, etc etc. It can get ugly. > > > > The problem is NOT in 10.x/-stable. > > > > We hit this in the freebsd.org cluster, the builds that we used are: > > FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine > > FreeBSD 11.0-CURRENT #0 r276779: Wed Jan 7 18:47:09 UTC 2015 - broken > > > > If you are running -current in a situation where it'll accumulate uptime, > > you may want to take precautions. A reboot prior to 24 days uptime (as > > horrible a workaround as that is) will avoid it. > > > > Yes, this is being worked on. > > So the issue is reproducable in 3 minutes after boot with the following > change in kern_clock.c: > volatile int ticks = INT_MAX - (/*hz*/1000 * 3 * 60); > > It is fixed (in the proper meaning of the word, not like worked around, > covered by paper) by the patch at the end of the mail. > > We already have a story trying to enable much less ambitious option > -fno-strict-overflow, see r259045 and the revert in r259422. I do not > see other way than try one more time. Too many places in kernel > depend on the correctly wrapping 2-complement arithmetic, among others > are callweel and scheduler.
Ugh. I believe I have a smoking gun that suggests that the clock-stop problem is caused by the clang-3.5 import on Dec 31st. Backstory: http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html http://www.airs.com/blog/archives/120 I suspect that what has happened is that clang's optimizer got better at seeing the direct or indirect effects of integer overflow and clang (and gcc) take advantage of that. I have used a slightly different change for about 10 years: --- kern/kern_clock.c 2014-12-01 15:42:21.707911656 -0800 +++ kern/kern_clock.c 2014-12-01 15:42:21.707911656 -0800 @@ -410,6 +415,11 @@ #ifdef SW_WATCHDOG EVENTHANDLER_REGISTER(watchdog_list, watchdog_config, NULL, 0); #endif + /* + * Arrange for ticks to go negative just 5 minutes after boot + * to help catch sign problems sooner. + */ + ticks = INT_MAX - (hz * 5 * 60); } /* This came about from when we had problems with integer overflow arithmetic in the tcp stack. In any case, I'm in the process of adding -fwrapv and the early wraparound to the freebsd.org cluster builds to give it some wider exercise. -- Peter Wemm - pe...@wemm.org; pe...@freebsd.org; pe...@yahoo-inc.com; KI6FJV UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246
Description: This is a digitally signed message part.