Wolfgang Dumhs wrote:
> Alan Robertson wrote:
>> Wolfgang Dumhs wrote:
>>> Alan Robertson wrote:
>>>> Wolfgang Dumhs wrote:
>>>>> Hi,
>>>>>
>>>>> im using heartbeat 1.2.3 on linux servers with kernel 2.6.5 on a
>>>>> couple
>>>>> of systems and some days ago again a machine had problems after an
>>>>> uptime of 497 days:
>>>>>
>>>>> Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: node dmask1: is dead
>>>>> Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Dead node dmask1 gave
>>>> up resources.
>>>>> Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: node dmask2: is dead
>>>>> Jul  6 14:20:01 dmask2 heartbeat[13308]: ERROR: No local heartbeat.
>>>> Forcing restart.
>>>>> Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Heartbeat shutdown in
>>>> progress. (13308)
>>>>> Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: node routers: is dead
>>>>> Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Resource takeover
>>>> cancelled - shutdown in progress.
>>>>> Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Link dmask1:eth1 dead.
>>>>> Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Link dmask1:eth0 dead.
>>>>> Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Link routers:routers
>>>> dead.
>>>>> Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: Late heartbeat: Node
>>>> dmask2: interval 41070 ms
>>>>> ...
>>>>>
>>>>> I have found some posts with similar problems, but no solution, so I
>>>>> began to search, where the failure comes from. And I think I found it:
>>>>> It is a design problem of the system call times(), which ist used in
>>>>> longclock.c. Under 32 bit linux, system calls are not designed to
>>>> return
>>>>> values between -4095 and -1, because this range is reserved for
>>>>> error-codes. To check this behaviour, I patched the kernel to begin
>>>>> returning small negativ numbers after system start:
>>>>>
>>>>> #define INITIAL_JIFFIES ((unsigned long)(0x100000000*HZ/USER_HZ -
>>>> 600*HZ))
>>> This doesn´t work an 32 bit systems, the rigth statement is:
>>>
>>> #define INITIAL_JIFFIES ((unsigned long long)(0x100000000*HZ/USER_HZ -
>>> 600*HZ))
>>>
>>> and has to be set in /usr/src/linux/include/linux/time.h
>>> or in /usr/src/linux/include/jiffies.h, depending on kernel version.
>>>
>>>>> Then I wrote a little program which ouputs the return of times(), and
>>>>> the errno-Variable, and the ouput is:
>>>>>
>>>>> Fri Jul 14 13:54:47 2006: times() returns: -4380, errno=0
>>>>> Fri Jul 14 13:54:48 2006: times() returns: -4280, errno=0
>>>>> Fri Jul 14 13:54:49 2006: times() returns: -4180, errno=0
>>>>> Fri Jul 14 13:54:50 2006: times() returns: -1, errno=4080
>>>>> Fri Jul 14 13:54:51 2006: times() returns: -1, errno=3980
>>>>> Fri Jul 14 13:54:52 2006: times() returns: -1, errno=3879
>>>>> Fri Jul 14 13:54:53 2006: times() returns: -1, errno=3779
>>>>> ...
>>>>> Fri Jul 14 13:55:27 2006: times() returns: -1, errno=372
>>>>> Fri Jul 14 13:55:28 2006: times() returns: -1, errno=272
>>>>> Fri Jul 14 13:55:29 2006: times() returns: -1, errno=171
>>>>> Fri Jul 14 13:55:30 2006: times() returns: -1, errno=71
>>>>> Fri Jul 14 13:55:31 2006: times() returns: 29, errno=71
>>>>> Fri Jul 14 13:55:32 2006: times() returns: 129, errno=71
>>>>> Fri Jul 14 13:55:33 2006: times() returns: 230, errno=71
>>>>> Fri Jul 14 13:55:34 2006: times() returns: 330, errno=71
>>
>>
>> That is _SOOO_ broken.  You need to get your kernel or library fixed.
>> It is COMPLETELY contradictory to the times(2) man page - and any kind
>> of rational system behavior.
>>
>> I've never heard of this.
> 
> I agree that this is broken, and it seems to be a libc-bug. I have tried
> with
> SuSE Linux 9.1 / Kernel 2.6.5 / glibc 2.3.3 and SuSE Linux 9.3 / Kernel
> 2.6.13
> / glibc 2.3.4 and the behaviour is the same.
> 
> The reason for this behaviour seems to be how return values of system calls
> are treeted by the library: small negativ numbers - in the range -1 to
> -4096
> - are supposed to be error-codes and thus are returned as -1 while the
> errno-variable is set to the negative return value. One can see this
> overall in
> the kernel where system calls do something like "return -EINVAL" or
> "return -ENOSYS":
> The wrapper for system calls in the library translates this to a return
> value of -1
> and sets the errno-variable to EINVAL or ENOSYS.
> 
> And the same happens to small negativ numbers during the system call
> times(2), which
> seems to have the standard system call wrapper in the library.
> 
> As I don´t know how one could fix that in the library and even if it
> were fixed
> not all people using heartbeat would update their glibc, I thought it
> would be a
> good idea to bypass this bug in the heartbeat-code with the patch I
> included in
> my former post.

OK.

There are several places in the code where we use times(2).  I've taken
your ideas and created a new patch for the 2.x CVS(HEAD) branch which I
think covers them all.

I've attached it here.

Wolfgang:  Could you test this and see if it works during wraparound?

It's not quite the same as yours, but avoids casting -1 to an unsigned.
 I think the effect should be the same.

Please let me/us all know.

-- 
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
Index: include/clplumbing/longclock.h
===================================================================
RCS file: /home/cvs/linux-ha/linux-ha/include/clplumbing/longclock.h,v
retrieving revision 1.9
diff -u -r1.9 longclock.h
--- include/clplumbing/longclock.h      22 Jun 2005 14:01:34 -0000      1.9
+++ include/clplumbing/longclock.h      15 Aug 2006 21:35:46 -0000
@@ -39,6 +39,10 @@
  *
  *     The functions provided here are:
  *
+ *     unsigned long   cl_times(void);
+ *                     A rational wrapper for the times(2) call
+ *                     for those cases where only the return value
+ *                     is wanted.
  *     longclock_t     time_longclock(void);
  *                     Returns current time as a longclock_t.
  *
@@ -77,6 +81,7 @@
  *
  *     extern const longclock_t        zero_longclock;
  */
+extern unsigned long cl_times(void);
 
 #ifdef CLOCK_T_IS_LONG_ENOUGH
 #      ifndef  HAVE_LONGCLOCK_ARITHMETIC
Index: lib/clplumbing/GSource.c
===================================================================
RCS file: /home/cvs/linux-ha/linux-ha/lib/clplumbing/GSource.c,v
retrieving revision 1.84
diff -u -r1.84 GSource.c
--- lib/clplumbing/GSource.c    17 Jul 2006 16:56:47 -0000      1.84
+++ lib/clplumbing/GSource.c    15 Aug 2006 21:35:46 -0000
@@ -970,7 +970,6 @@
        *timeoutms = 1000;      /* Sigh... */
 
        if (sig_src->signal_triggered) {
-               static struct tms       dummy_tms_struct;
                clock_t                 now;
                clock_t                 diff;
 
@@ -980,7 +979,7 @@
                        return TRUE;
                }
                /* Otherwise, this is when it was first detected */
-               now = times(&dummy_tms_struct);
+               now = cl_times();
                diff = now - sig_src->sh_detecttime;    /* How long since 
signal occurred? */
                lc_store(
                        sig_src->detecttime,
@@ -1004,14 +1003,13 @@
        g_assert(IS_SIGSOURCE(sig_src));
        
        if (sig_src->signal_triggered) {
-               static struct tms       dummy_tms_struct;
                clock_t                 now;
                clock_t                 diff;
                if (cmp_longclock(lc_fetch(sig_src->detecttime), 
zero_longclock) != 0){
                        return TRUE;
                }
                /* Otherwise, this is when it was first detected */
-               now = times(&dummy_tms_struct);
+               now = cl_times();
                diff = now - sig_src->sh_detecttime;
                lc_store(
                        sig_src->detecttime,
@@ -1072,7 +1070,6 @@
 static void
 G_main_signal_handler(int nsig)
 {
-       static struct tms       dummy_tms_struct;
        GSIGSource* sig_src = NULL;
 
        if(G_main_signal_list == NULL) {
@@ -1095,7 +1092,7 @@
        /* Time from first occurance of signal */
        if (!sig_src->signal_triggered) {
                /* Avoid calling longclock_time() on a signal */
-               sig_src->sh_detecttime=times(&dummy_tms_struct);
+               sig_src->sh_detecttime=cl_times();
        }
        sig_src->signal_triggered = TRUE;
 }
Index: lib/clplumbing/cl_random.c
===================================================================
RCS file: /home/cvs/linux-ha/linux-ha/lib/clplumbing/cl_random.c,v
retrieving revision 1.4
diff -u -r1.4 cl_random.c
--- lib/clplumbing/cl_random.c  15 Aug 2006 17:54:46 -0000      1.4
+++ lib/clplumbing/cl_random.c  15 Aug 2006 21:35:46 -0000
@@ -136,7 +136,6 @@
        char buf[16];
        FILE* fs;
        struct timeval tv;
-       static struct tms       dummy_tms_struct;
        const char * randdevname [] = {"/dev/urandom", "/dev/random"};
        int                     idev;
 #if 0
@@ -217,14 +216,8 @@
        /*
         * times(2) returns the number of clock ticks since
         * boot.  Fairly predictable, but not completely so...
-        *
-        * The man page claims times(2) can fail, but the
-        * correct return result also might include -1 so
-        * in practice it's impossible to tell, and since
-        * dummy_tms_struct can't be invalid there is no
-        * known reason why it should fail.
         */
-       return (unsigned int) times(&dummy_tms_struct);
+       return (unsigned int) cl_times();
 
 
 #if 0
Index: lib/clplumbing/longclock.c
===================================================================
RCS file: /home/cvs/linux-ha/linux-ha/lib/clplumbing/longclock.c,v
retrieving revision 1.20
diff -u -r1.20 longclock.c
--- lib/clplumbing/longclock.c  6 Feb 2006 13:53:53 -0000       1.20
+++ lib/clplumbing/longclock.c  15 Aug 2006 21:35:46 -0000
@@ -24,6 +24,7 @@
 #include <portability.h>
 #include <unistd.h>
 #include <sys/times.h>
+#include <errno.h>
 #include <clplumbing/longclock.h>
 #include <clplumbing/cl_log.h>
 
@@ -61,14 +62,59 @@
        return Hz;
 }
 
-static struct tms      longclock_dummy_tms_struct;
+#ifdef TIMES_ALLOWS_NULL_PARAM
+#      define  TIMES_PARAM     NULL
+#else
+       static struct tms       dummy_longclock_tms_struct;
+#      define  TIMES_PARAM     &dummy_longclock_tms_struct
+#endif
+
+unsigned long
+cl_times(void) /* Make times(2) behave rationally on Linux */
+{
+       int             save_errno = errno;
+       clock_t ret;
+
+       /*
+        * times(2) really returns an unsigned value ...
+        *
+        * We don't check to see if we got back the error value (-1), because
+        * the only possibility for an error would be if the address of 
+        * longclock_dummy_tms_struct was invalid.  Since it's a
+        * compiler-generated address, we assume that errors are impossible.
+        * And, unfortunately, it is quite possible for the correct return
+        * from times(2) to be exactly (clock_t)-1.  Sigh...
+        *
+        */
+       errno   = 0;
+       ret     = times(TIMES_PARAM);
+
+/*
+ *     This is to work around a bug in the system call interface
+ *     for times(2) found in glibc on Linux (and maybe elsewhere)
+ *     It changes the return values from -1 to -4096 all into
+ *     -1 and then dumps the -(return value) into errno.
+ *
+ *     This totally bizarre behavior seems to be widespread in
+ *     versions of Linux and glibc.
+ *
+ *     Many thanks to Wolfgang Dumhs <wolfgang.dumhs (at) gmx.at>
+ *     for finding and documenting this bizarre behavior.
+ */
+       if (errno != 0) {
+               ret = (clock_t) (-errno);
+       }
+       errno = save_errno;
+       return (unsigned long)ret;
+}
+
 #ifdef CLOCK_T_IS_LONG_ENOUGH
 longclock_t
 time_longclock(void)
 {
 
        /* See note below about deliberately ignoring errors... */
-       return (longclock_t)times(&longclock_dummy_tms_struct);
+       return (longclock_t)cl_times();
 }
 
 #else  /* clock_t is shorter than 64 bits */
@@ -89,18 +135,8 @@
        unsigned long           timesval;
 
        ++callcount;
-       /*
-        * times(2) really returns an unsigned value ...
-        *
-        * We don't check to see if we got back the error value (-1), because
-        * the only possibility for an error would be if the address of 
-        * longclock_dummy_tms_struct was invalid.  Since it's a
-        * compiler-generated address, we assume that errors are impossible.
-        * And, unfortunately, it is quite possible for the correct return
-        * from times(2) to be exactly (clock_t)-1.  Sigh...
-        *
-        */
-       timesval = (unsigned long) times(&longclock_dummy_tms_struct);
+
+       timesval = (unsigned long) cl_times();
 
        if (calledbefore && timesval < lasttimes)  {
                clock_t         jumpbackby = lasttimes - timesval;
Index: lib/stonith/expect.c
===================================================================
RCS file: /home/cvs/linux-ha/linux-ha/lib/stonith/expect.c,v
retrieving revision 1.21
diff -u -r1.21 expect.c
--- lib/stonith/expect.c        11 May 2005 11:49:20 -0000      1.21
+++ lib/stonith/expect.c        15 Aug 2006 21:35:47 -0000
@@ -44,6 +44,7 @@
 #include <netinet/in.h>
 #include <arpa/inet.h>
 #include <stonith/st_ttylock.h>
+#include <clplumbing/longclock.h>
 #define ENABLE_PIL_DEFS_PRIVATE
 #include <pils/plugin.h>
 
@@ -65,6 +66,32 @@
 #define STRDUP         StonithPIsys->imports->mstrdup
 #define FREE(p)               {StonithPIsys->imports->mfree(p); (p) = NULL;}
 
+#ifdef TIMES_ALLOWS_NULL_PARAM
+#      define  TIMES_PARAM     NULL
+#else
+       static struct tms       dummy_longclock_tms_struct;
+#      define  TIMES_PARAM     &dummy_longclock_tms_struct
+#endif
+
+static unsigned long
+our_times(void)        /* Make times(2) behave rationally on Linux */
+{
+       int             save_errno;
+       clock_t ret;
+
+       /*
+        * This code copied from clplumbing/longclock.c to avoid
+        * making STONITH depend on clplumbing.  See it for an explanation
+        */
+       errno   = 0;
+       ret     = times(TIMES_PARAM);
+
+       if (errno != 0) {
+               ret = (clock_t) (-errno);
+       }
+       errno = save_errno;
+       return (unsigned long)ret;
+}
 /*
  *     Look for ('expect') any of a series of tokens in the input
  *     Return the token type for the given token or -1 on error.
@@ -74,11 +101,6 @@
 ExpectToken(int        fd, struct Etoken * toklist, int to_secs, char * savebuf
 ,      int maxline, int Debug)
 {
-       /*
-        * We use unsigned long instead of clock_t here because it's signed,
-        * but the return value from times() is basically unsigned...
-        * This is broken, but according to POSIX ;-)
-        */
        unsigned long   starttime;
        unsigned long   endtime;
        int             wraparound=0;
@@ -94,7 +116,7 @@
 
        /* Figure out when to give up.  Handle lbolt wraparound */
 
-       starttime = times(NULL);
+       starttime = our_times();
        ticks = (to_secs*hz);
        endtime = starttime + ticks;
 
@@ -111,7 +133,7 @@
        }
 
 
-       while (now = times(NULL),
+       while (now = our_times(),
                (wraparound && (now > starttime || now <= endtime))
                ||      (!wraparound && now <= endtime)) {
 
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to