[Linux-ha-dev] heartbeat fails every 497 days on 32 bit linux

Wolfgang Dumhs Fri, 14 Jul 2006 20:03:32 -0700

Hi,

im using heartbeat 1.2.3 on linux servers with kernel 2.6.5 on a coupleof systems and some days ago again a machine had problems after anuptime of 497 days:


Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: node dmask1: is dead

Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Dead node dmask1 gave upresources.

Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: node dmask2: is dead

Jul 6 14:20:01 dmask2 heartbeat[13308]: ERROR: No local heartbeat.Forcing restart.Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Heartbeat shutdown inprogress. (13308)

Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: node routers: is dead

Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Resource takeovercancelled - shutdown in progress.

Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Link dmask1:eth1 dead.
Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Link dmask1:eth0 dead.
Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Link routers:routers dead.

Jul 6 14:20:01 dmask2 heartbeat[13308]: WARN: Late heartbeat: Nodedmask2: interval 41070 ms

...

I have found some posts with similar problems, but no solution, so Ibegan to search, where the failure comes from. And I think I found it:It is a design problem of the system call times(), which ist used inlongclock.c. Under 32 bit linux, system calls are not designed to returnvalues between -4095 and -1, because this range is reserved forerror-codes. To check this behaviour, I patched the kernel to beginreturning small negativ numbers after system start:


#define INITIAL_JIFFIES ((unsigned long)(0x100000000*HZ/USER_HZ - 600*HZ))

Then I wrote a little program which ouputs the return of times(), andthe errno-Variable, and the ouput is:


Fri Jul 14 13:54:47 2006: times() returns: -4380, errno=0
Fri Jul 14 13:54:48 2006: times() returns: -4280, errno=0
Fri Jul 14 13:54:49 2006: times() returns: -4180, errno=0
Fri Jul 14 13:54:50 2006: times() returns: -1, errno=4080
Fri Jul 14 13:54:51 2006: times() returns: -1, errno=3980
Fri Jul 14 13:54:52 2006: times() returns: -1, errno=3879
Fri Jul 14 13:54:53 2006: times() returns: -1, errno=3779
...
Fri Jul 14 13:55:27 2006: times() returns: -1, errno=372
Fri Jul 14 13:55:28 2006: times() returns: -1, errno=272
Fri Jul 14 13:55:29 2006: times() returns: -1, errno=171
Fri Jul 14 13:55:30 2006: times() returns: -1, errno=71
Fri Jul 14 13:55:31 2006: times() returns: 29, errno=71
Fri Jul 14 13:55:32 2006: times() returns: 129, errno=71
Fri Jul 14 13:55:33 2006: times() returns: 230, errno=71
Fri Jul 14 13:55:34 2006: times() returns: 330, errno=71

So this explains the late heartbeat message with 41 seconds in thelog-file above.

I see two solutions: If one has a deadtime of more than 45 seconds,there is just a late heartbeat warning on both nodes but nothing elsehappens. The second solution is a patch in longclock.c:


--- lib/clplumbing/longclock.c.orig  2004-08-16 02:07:41.000000000 +0200
+++ lib/clplumbing/longclock.c  2006-07-10 19:19:40.000000000 +0200
@@ -24,6 +24,7 @@
 #include <portability.h>
 #include <unistd.h>
 #include <sys/times.h>
+#include <errno.h>
 #include <clplumbing/longclock.h>

 #ifndef CLOCK_T_IS_LONG_ENOUGH
@@ -87,6 +88,10 @@

        /* times really returns an unsigned value ... */
        timesval = (unsigned long) times(&longclock_dummy_tms_struct);
+       /* due to design problem of times() ... terribly uggly! */
+       if (timesval == (unsigned long)-1) {
+               timesval = -errno;
+       }

        if (!lasttimes) {
                lasttimes = timesval;

With this patch, no problems occured during a wrap of times().

--
Wolfgang Dumhs
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[Linux-ha-dev] heartbeat fails every 497 days on 32 bit linux

Reply via email to