[Linux-ha-dev] heartbeat fails every 497 days on 32 bit linux

Wolfgang Dumhs Mon, 14 Aug 2006 09:53:28 -0700

Alan Robertson wrote:

Wolfgang Dumhs wrote:
> Hi,
>
> im using heartbeat 1.2.3 on linux servers with kernel 2.6.5 on a couple
> of systems and some days ago again a machine had problems after an
> uptime of 497 days:
>
> Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: node dmask1: is dead
> Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Dead node dmask1 gave up 
resources.
> Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: node dmask2: is dead
> Jul  6 14:20:01 dmask2 heartbeat[13308]: ERROR: No local heartbeat. Forcing 
restart.
> Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Heartbeat shutdown in 
progress. (13308)
> Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: node routers: is dead
> Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Resource takeover cancelled - 
shutdown in progress.
> Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Link dmask1:eth1 dead.
> Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Link dmask1:eth0 dead.
> Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Link routers:routers dead.
> Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: Late heartbeat: Node dmask2: 
interval 41070 ms
> ...
>
> I have found some posts with similar problems, but no solution, so I
> began to search, where the failure comes from. And I think I found it:
> It is a design problem of the system call times(), which ist used in
> longclock.c. Under 32 bit linux, system calls are not designed to return
> values between -4095 and -1, because this range is reserved for
> error-codes. To check this behaviour, I patched the kernel to begin
> returning small negativ numbers after system start:
>
> #define INITIAL_JIFFIES ((unsigned long)(0x100000000*HZ/USER_HZ - 600*HZ))


This doesn´t work an 32 bit systems, the rigth statement is:

#define INITIAL_JIFFIES ((unsigned long long)(0x100000000*HZ/USER_HZ - 600*HZ))

and has to be set in /usr/src/linux/include/linux/time.h
or in /usr/src/linux/include/jiffies.h, depending on kernel version.

>
> Then I wrote a little program which ouputs the return of times(), and
> the errno-Variable, and the ouput is:
>
> Fri Jul 14 13:54:47 2006: times() returns: -4380, errno=0
> Fri Jul 14 13:54:48 2006: times() returns: -4280, errno=0
> Fri Jul 14 13:54:49 2006: times() returns: -4180, errno=0
> Fri Jul 14 13:54:50 2006: times() returns: -1, errno=4080
> Fri Jul 14 13:54:51 2006: times() returns: -1, errno=3980
> Fri Jul 14 13:54:52 2006: times() returns: -1, errno=3879
> Fri Jul 14 13:54:53 2006: times() returns: -1, errno=3779
> ...
> Fri Jul 14 13:55:27 2006: times() returns: -1, errno=372
> Fri Jul 14 13:55:28 2006: times() returns: -1, errno=272
> Fri Jul 14 13:55:29 2006: times() returns: -1, errno=171
> Fri Jul 14 13:55:30 2006: times() returns: -1, errno=71
> Fri Jul 14 13:55:31 2006: times() returns: 29, errno=71
> Fri Jul 14 13:55:32 2006: times() returns: 129, errno=71
> Fri Jul 14 13:55:33 2006: times() returns: 230, errno=71
> Fri Jul 14 13:55:34 2006: times() returns: 330, errno=71
>
> So this explains the late heartbeat message with 41 seconds in the
> log-file above.
>
> I see two solutions: If one has a deadtime of more than 45 seconds,
> there is just a late heartbeat warning on both nodes but nothing else
> happens. The second solution is a patch in longclock.c:
>
> --- lib/clplumbing/longclock.c.orig  2004-08-16 02:07:41.000000000 +0200
> +++ lib/clplumbing/longclock.c  2006-07-10 19:19:40.000000000 +0200
> @@ -24,6 +24,7 @@
>  #include <portability.h>
>  #include <unistd.h>
>  #include <sys/times.h>
> +#include <errno.h>
>  #include <clplumbing/longclock.h>
>
>  #ifndef CLOCK_T_IS_LONG_ENOUGH
> @@ -87,6 +88,10 @@
>
>         /* times really returns an unsigned value ... */
>         timesval = (unsigned long) times(&longclock_dummy_tms_struct);
> +       /* due to design problem of times() ... terribly uggly! */
> +       if (timesval == (unsigned long)-1) {
> +               timesval = -errno;
> +       }
>
>         if (!lasttimes) {
>                 lasttimes = timesval;
>
> With this patch, no problems occured during a wrap of times().

It's fixed in 1.2.4 If I recall correctly...


I have not testet 1.2.4, but 2.0.6, and there the problem still exists, as one
can see in the logfile from a test-system with a INITIAL_JIFFIES-patched kernel:

Jul 25 21:00:02 fwz0000 heartbeat: [1142]: WARN: Gmain_timeout_dispatch: 
Dispatch function for send local status was delayed 40140 ms before being 
called (GSource: 0x80fb178)
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: info: Gmain_timeout_dispatch: 
started at 4294967295 should have started at 4294963281
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: WARN: Late heartbeat: Node 
fwz0000r1: interval 41140 ms
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: WARN: node server: is dead
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: info: Dead node server gave up 
resources.
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: info: Link server:eth0 dead.
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: WARN: Gmain_timeout_dispatch: 
Dispatch function for check for signals was delayed 40780 ms before being 
called (GSource: 0x80fb480)
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: info: Gmain_timeout_dispatch: 
started at 4294967295 should have started at 4294963217
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: WARN: Gmain_timeout_dispatch: 
Dispatch function for update msgfree count was delayed 40820 ms before being 
called (GSource: 0x80fb550)
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: info: Gmain_timeout_dispatch: 
started at 4294967295 should have started at 4294963213
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: WARN: Gmain_timeout_dispatch: 
Dispatch function for client audit was delayed 32550 ms before being called 
(GSource: 0x80fb3b0)
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: info: Gmain_timeout_dispatch: 
started at 4294967295 should have started at 4294964040
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: CRIT: Cluster node server returning 
after partition.
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: info: For information on cluster 
partitions, See URL: http://linux-ha.org/SplitBrain
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: WARN: Deadtime value may be too 
small.
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: info: See FAQ for information on 
tuning deadtime.
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: info: URL: 
http://linux-ha.org/FAQ#heavy_load
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: info: Link server:eth0 up.
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: WARN: Late heartbeat: Node server: 
interval 41320 ms
Jul 25 21:00:02 fwz0000 heartbeat: [1142]: info: Status update for node server: 
status active
Jul 25 21:00:02 fwz0000 heartbeat: [2104]: debug: notify_world: setting SIGCHLD 
Handler to SIG_DFL
Jul 25 21:00:02 fwz0000 harc[2104]: info: Running /etc/ha.d/rc.d/status status
Jul 25 21:00:33 fwz0000 heartbeat: [1142]: ERROR: Both machines own our 
resources!
Jul 25 21:00:33 fwz0000 heartbeat: [1142]: ERROR: Both machines own foreign 
resources!
Jul 25 21:00:34 fwz0000 heartbeat: [1142]: ERROR: Both machines own our 
resources!
Jul 25 21:00:34 fwz0000 heartbeat: [1142]: ERROR: Both machines own foreign 
resources!
Jul 25 21:00:43 fwz0000 heartbeat: [1145]: info: time_longclock: clock_t 
wrapped around (uptime).
Jul 25 21:00:43 fwz0000 logd: [1011]: info: time_longclock: clock_t wrapped 
around (uptime).
Jul 25 21:00:43 fwz0000 logd: [1020]: info: time_longclock: clock_t wrapped 
around (uptime).
Jul 25 21:00:43 fwz0000 heartbeat: [1142]: info: time_longclock: clock_t 
wrapped around (uptime).
Jul 25 21:00:43 fwz0000 heartbeat: [1147]: info: time_longclock: clock_t 
wrapped around (uptime).
Jul 25 21:00:44 fwz0000 heartbeat: [1146]: info: time_longclock: clock_t 
wrapped around (uptime).
Jul 25 21:00:45 fwz0000 heartbeat: [1142]: info: Heartbeat shutdown in 
progress. (1142)
Jul 25 21:00:45 fwz0000 heartbeat: [2114]: info: Giving up all HA resources.
...

Again, there are late heartbeat messages with a little more than 40 seconds.

With the following patch, nothing strange happens during the wrap of times(), 
just some messages with
info: time_longclock: clock_t wrapped around (uptime).:

--- lib/clplumbing/longclock.c.orig     2006-07-25 17:24:16.000000000 +0200
+++ lib/clplumbing/longclock.c  2006-07-25 17:36:01.000000000 +0200
@@ -24,6 +24,7 @@
 #include <portability.h>
 #include <unistd.h>
 #include <sys/times.h>
+#include <errno.h>
 #include <clplumbing/longclock.h>
 #include <clplumbing/cl_log.h>

@@ -99,8 +100,18 @@
         * And, unfortunately, it is quite possible for the correct return
         * from times(2) to be exactly (clock_t)-1.  Sigh...
         *
+        * Due to design problems of times(2) we get the error value (-1)
+        * when a value between -4096 and -1 should be returned, because
+        * that range is reserved for error codes. Therefore, the real value
+        * is stored in the errno-varibale. To be safe if someday that error
+        * of libc is fixed, we set errno to 1 before the system call, and in
+        * case of a correct return value of -1 nothing bad should happen...
         */
+       errno = 1;
        timesval = (unsigned long) times(&longclock_dummy_tms_struct);
+       if (timesval == (unsigned long)-1) {
+               timesval = -errno;
+       }

        if (calledbefore && timesval < lasttimes)  {
                clock_t         jumpbackby = lasttimes - timesval;

--
Wolfgang Dumhs


_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[Linux-ha-dev] heartbeat fails every 497 days on 32 bit linux

Reply via email to