Re: [Linux-HA] [patch] clock_t wrapped around causing false resourcestart failure

Lars Ellenberg Wed, 01 Jul 2009 03:28:50 -0700

On Tue, Jun 30, 2009 at 10:03:29AM -0400, Tavanyar, Simon wrote:
> Hi Dejan,
> 
> The bug looks like a one-off occurrence. We run hundreds of hours of
> system stress tests in a week, moving resources between main and standby
> systems, and we haven't seen this error in a couple years. (There was a
> longclock error back in 2007 found by my colleague Simon Graham).
> 
> The longclock wrap occurred within 2:45 of a reboot. 
> The apparent coincidence seems to be that we were starting resources on
> a back-up node around 165 seconds after the node had been rebooted and
> hearbeat restarted. As I expect you know, somewhere between 160 and 175
> seconds after a heartbeat start, the longclock is configured to wrap.


Nonono.
times() is coupled to uptime,
uptime is measured in jiffies,
and initial jiffies in linux kernel is
include/linux/jiffies.h:
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))

so the first wrap occurs 5 minutes after boot.
nothing to do with heartbeat start, or longclock.

> The rareness of this makes me think we hit a really obscure window... 

yes.
but.

it is totally broken.
some grepping in my mercurial checkouts suggest that you are using
heartbeat, which is using its "plumbing" wrappers.

[skip this paragraph unless you are heartbeat coder]
in particular, the killing is done by
TrackedProcTimeoutFunction(),
and the timeout for that is set by SetTrackedProcTimeouts(),
which calls into Gmain_timeout_add -> Gmain_timeout_add_full
(still in the heartbeat plumbing wrappers),
which creates a new event source of type GTimoutAppend
(also heartbeat plumbing) with methods defined by Gmain_timeout_funcs.

The Gmain_timeout_prepare does a simple check on
"is the next time this event shall trigger in the past?",
and if so, triggers immediately.

Now, that check is broken.
in lib/clplumbing/GSource.c, Gmain_timeout_prepare():
        if (cmp_longclock(lnow, append->nexttime) >= 0)
assuming that nexttime was set correctly, and lnow is correct, too,
and further assuming your clock_t is only 32 bit,
longclock_t is defined as unsigned long long,
and that thing becomes:

        if ((unsigned long long)(lnow) >= (unsigned long long)(append-nexttime))

which exactly does _NOT_ care for wrap around :(

cmp_longclock macro and functions are broken.  conceivably some other
longclock related macros/functions may be broken as well.

it should probably be more like
(caution, not even compile tested)

diff -r 731f8f7b5450 include/clplumbing/longclock.h
--- a/include/clplumbing/longclock.h    Tue Jun 30 12:02:16 2009 +0200
+++ b/include/clplumbing/longclock.h    Wed Jul 01 12:27:27 2009 +0200
@@ -127,9 +127,9 @@
        ((longclock_t)(l) - (longclock_t)(r))
 
 #      define  cmp_longclock(l,r)                      \
-       (((longclock_t)(l) < (longclock_t)(r))          \
+       ((((long long)(l) - (long long)(r)) < 0)        \
        ?       -1                                      \
-       : (((longclock_t)(l) > (longclock_t)(r))        \
+       : ((((long long)(l) - (long long)(r)) > 0)      \
        ?       +1 : 0))
 #endif
 
diff -r 731f8f7b5450 lib/clplumbing/longclock.c
--- a/lib/clplumbing/longclock.c        Tue Jun 30 12:02:16 2009 +0200
+++ b/lib/clplumbing/longclock.c        Wed Jul 01 12:27:27 2009 +0200
@@ -259,12 +259,7 @@
 int
 cmp_longclock(longclock_t l, longclock_t r)
 {
-       if (l < r) {
-               return -1;
-       }
-       if (l > r) {
-               return 1;
-       }
-       return 0;
+       return (((long long)(l) - (long long)(r)) < 0) ? -1
+            : (((long long)(l) - (long long)(r)) > 0) ? +1 : 0;
 }
 #endif /* CLOCK_T_IS_LONG_ENOUGH */



-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] [patch] clock_t wrapped around causing false resourcestart failure

Reply via email to