Aahh ... so it's measured after reboot. That makes sense. Thanks, Lars. Should I wait for you to log a bugzilla on this?
-----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Lars Ellenberg Sent: Wednesday, July 01, 2009 6:28 AM To: [email protected] Subject: Re: [Linux-HA] [patch] clock_t wrapped around causingfalse resourcestart failure On Tue, Jun 30, 2009 at 10:03:29AM -0400, Tavanyar, Simon wrote: > Hi Dejan, > > The bug looks like a one-off occurrence. We run hundreds of hours of > system stress tests in a week, moving resources between main and standby > systems, and we haven't seen this error in a couple years. (There was a > longclock error back in 2007 found by my colleague Simon Graham). > > The longclock wrap occurred within 2:45 of a reboot. > The apparent coincidence seems to be that we were starting resources on > a back-up node around 165 seconds after the node had been rebooted and > hearbeat restarted. As I expect you know, somewhere between 160 and 175 > seconds after a heartbeat start, the longclock is configured to wrap. Nonono. times() is coupled to uptime, uptime is measured in jiffies, and initial jiffies in linux kernel is include/linux/jiffies.h: #define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ)) so the first wrap occurs 5 minutes after boot. nothing to do with heartbeat start, or longclock. > The rareness of this makes me think we hit a really obscure window... yes. but. it is totally broken. some grepping in my mercurial checkouts suggest that you are using heartbeat, which is using its "plumbing" wrappers. [skip this paragraph unless you are heartbeat coder] in particular, the killing is done by TrackedProcTimeoutFunction(), and the timeout for that is set by SetTrackedProcTimeouts(), which calls into Gmain_timeout_add -> Gmain_timeout_add_full (still in the heartbeat plumbing wrappers), which creates a new event source of type GTimoutAppend (also heartbeat plumbing) with methods defined by Gmain_timeout_funcs. The Gmain_timeout_prepare does a simple check on "is the next time this event shall trigger in the past?", and if so, triggers immediately. Now, that check is broken. in lib/clplumbing/GSource.c, Gmain_timeout_prepare(): if (cmp_longclock(lnow, append->nexttime) >= 0) assuming that nexttime was set correctly, and lnow is correct, too, and further assuming your clock_t is only 32 bit, longclock_t is defined as unsigned long long, and that thing becomes: if ((unsigned long long)(lnow) >= (unsigned long long)(append-nexttime)) which exactly does _NOT_ care for wrap around :( cmp_longclock macro and functions are broken. conceivably some other longclock related macros/functions may be broken as well. it should probably be more like (caution, not even compile tested) diff -r 731f8f7b5450 include/clplumbing/longclock.h --- a/include/clplumbing/longclock.h Tue Jun 30 12:02:16 2009 +0200 +++ b/include/clplumbing/longclock.h Wed Jul 01 12:27:27 2009 +0200 @@ -127,9 +127,9 @@ ((longclock_t)(l) - (longclock_t)(r)) # define cmp_longclock(l,r) \ - (((longclock_t)(l) < (longclock_t)(r)) \ + ((((long long)(l) - (long long)(r)) < 0) \ ? -1 \ - : (((longclock_t)(l) > (longclock_t)(r)) \ + : ((((long long)(l) - (long long)(r)) > 0) \ ? +1 : 0)) #endif diff -r 731f8f7b5450 lib/clplumbing/longclock.c --- a/lib/clplumbing/longclock.c Tue Jun 30 12:02:16 2009 +0200 +++ b/lib/clplumbing/longclock.c Wed Jul 01 12:27:27 2009 +0200 @@ -259,12 +259,7 @@ int cmp_longclock(longclock_t l, longclock_t r) { - if (l < r) { - return -1; - } - if (l > r) { - return 1; - } - return 0; + return (((long long)(l) - (long long)(r)) < 0) ? -1 + : (((long long)(l) - (long long)(r)) > 0) ? +1 : 0; } #endif /* CLOCK_T_IS_LONG_ENOUGH */ -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD(r) and LINBIT(r) are registered trademarks of LINBIT, Austria. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
