On Tue, Jun 30, 2009 at 10:03:29AM -0400, Tavanyar, Simon wrote:
> Hi Dejan,
>
> The bug looks like a one-off occurrence. We run hundreds of hours of
> system stress tests in a week, moving resources between main and standby
> systems, and we haven't seen this error in a couple years. (There was a
> longclock error back in 2007 found by my colleague Simon Graham).
>
> The longclock wrap occurred within 2:45 of a reboot.
> The apparent coincidence seems to be that we were starting resources on
> a back-up node around 165 seconds after the node had been rebooted and
> hearbeat restarted. As I expect you know, somewhere between 160 and 175
> seconds after a heartbeat start, the longclock is configured to wrap.
Nonono.
times() is coupled to uptime,
uptime is measured in jiffies,
and initial jiffies in linux kernel is
include/linux/jiffies.h:
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))
so the first wrap occurs 5 minutes after boot.
nothing to do with heartbeat start, or longclock.
> The rareness of this makes me think we hit a really obscure window...
yes.
but.
it is totally broken.
some grepping in my mercurial checkouts suggest that you are using
heartbeat, which is using its "plumbing" wrappers.
[skip this paragraph unless you are heartbeat coder]
in particular, the killing is done by
TrackedProcTimeoutFunction(),
and the timeout for that is set by SetTrackedProcTimeouts(),
which calls into Gmain_timeout_add -> Gmain_timeout_add_full
(still in the heartbeat plumbing wrappers),
which creates a new event source of type GTimoutAppend
(also heartbeat plumbing) with methods defined by Gmain_timeout_funcs.
The Gmain_timeout_prepare does a simple check on
"is the next time this event shall trigger in the past?",
and if so, triggers immediately.
Now, that check is broken.
in lib/clplumbing/GSource.c, Gmain_timeout_prepare():
if (cmp_longclock(lnow, append->nexttime) >= 0)
assuming that nexttime was set correctly, and lnow is correct, too,
and further assuming your clock_t is only 32 bit,
longclock_t is defined as unsigned long long,
and that thing becomes:
if ((unsigned long long)(lnow) >= (unsigned long long)(append-nexttime))
which exactly does _NOT_ care for wrap around :(
cmp_longclock macro and functions are broken. conceivably some other
longclock related macros/functions may be broken as well.
it should probably be more like
(caution, not even compile tested)
diff -r 731f8f7b5450 include/clplumbing/longclock.h
--- a/include/clplumbing/longclock.h Tue Jun 30 12:02:16 2009 +0200
+++ b/include/clplumbing/longclock.h Wed Jul 01 12:27:27 2009 +0200
@@ -127,9 +127,9 @@
((longclock_t)(l) - (longclock_t)(r))
# define cmp_longclock(l,r) \
- (((longclock_t)(l) < (longclock_t)(r)) \
+ ((((long long)(l) - (long long)(r)) < 0) \
? -1 \
- : (((longclock_t)(l) > (longclock_t)(r)) \
+ : ((((long long)(l) - (long long)(r)) > 0) \
? +1 : 0))
#endif
diff -r 731f8f7b5450 lib/clplumbing/longclock.c
--- a/lib/clplumbing/longclock.c Tue Jun 30 12:02:16 2009 +0200
+++ b/lib/clplumbing/longclock.c Wed Jul 01 12:27:27 2009 +0200
@@ -259,12 +259,7 @@
int
cmp_longclock(longclock_t l, longclock_t r)
{
- if (l < r) {
- return -1;
- }
- if (l > r) {
- return 1;
- }
- return 0;
+ return (((long long)(l) - (long long)(r)) < 0) ? -1
+ : (((long long)(l) - (long long)(r)) > 0) ? +1 : 0;
}
#endif /* CLOCK_T_IS_LONG_ENOUGH */
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems