Wolfgang Dumhs wrote:
> Alan Robertson wrote:
>> Wolfgang Dumhs wrote:
>>> Alan Robertson wrote:
>>>> Wolfgang Dumhs wrote:
>>>>> Hi,
>>>>>
>>>>> im using heartbeat 1.2.3 on linux servers with kernel 2.6.5 on a
>>>>> couple
>>>>> of systems and some days ago again a machine had problems after an
>>>>> uptime of 497 days:
>>>>>
>>>>> Jul 6 14:20:01 dmask2 heartbeat[13308]: WARN: node dmask1: is dead
>>>>> Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Dead node dmask1 gave
>>>> up resources.
>>>>> Jul 6 14:20:01 dmask2 heartbeat[13308]: WARN: node dmask2: is dead
>>>>> Jul 6 14:20:01 dmask2 heartbeat[13308]: ERROR: No local heartbeat.
>>>> Forcing restart.
>>>>> Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Heartbeat shutdown in
>>>> progress. (13308)
>>>>> Jul 6 14:20:01 dmask2 heartbeat[13308]: WARN: node routers: is dead
>>>>> Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Resource takeover
>>>> cancelled - shutdown in progress.
>>>>> Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Link dmask1:eth1 dead.
>>>>> Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Link dmask1:eth0 dead.
>>>>> Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Link routers:routers
>>>> dead.
>>>>> Jul 6 14:20:01 dmask2 heartbeat[13308]: WARN: Late heartbeat: Node
>>>> dmask2: interval 41070 ms
>>>>> ...
>>>>>
>>>>> I have found some posts with similar problems, but no solution, so I
>>>>> began to search, where the failure comes from. And I think I found it:
>>>>> It is a design problem of the system call times(), which ist used in
>>>>> longclock.c. Under 32 bit linux, system calls are not designed to
>>>> return
>>>>> values between -4095 and -1, because this range is reserved for
>>>>> error-codes. To check this behaviour, I patched the kernel to begin
>>>>> returning small negativ numbers after system start:
>>>>>
>>>>> #define INITIAL_JIFFIES ((unsigned long)(0x100000000*HZ/USER_HZ -
>>>> 600*HZ))
>>> This doesn´t work an 32 bit systems, the rigth statement is:
>>>
>>> #define INITIAL_JIFFIES ((unsigned long long)(0x100000000*HZ/USER_HZ -
>>> 600*HZ))
>>>
>>> and has to be set in /usr/src/linux/include/linux/time.h
>>> or in /usr/src/linux/include/jiffies.h, depending on kernel version.
>>>
>>>>> Then I wrote a little program which ouputs the return of times(), and
>>>>> the errno-Variable, and the ouput is:
>>>>>
>>>>> Fri Jul 14 13:54:47 2006: times() returns: -4380, errno=0
>>>>> Fri Jul 14 13:54:48 2006: times() returns: -4280, errno=0
>>>>> Fri Jul 14 13:54:49 2006: times() returns: -4180, errno=0
>>>>> Fri Jul 14 13:54:50 2006: times() returns: -1, errno=4080
>>>>> Fri Jul 14 13:54:51 2006: times() returns: -1, errno=3980
>>>>> Fri Jul 14 13:54:52 2006: times() returns: -1, errno=3879
>>>>> Fri Jul 14 13:54:53 2006: times() returns: -1, errno=3779
>>>>> ...
>>>>> Fri Jul 14 13:55:27 2006: times() returns: -1, errno=372
>>>>> Fri Jul 14 13:55:28 2006: times() returns: -1, errno=272
>>>>> Fri Jul 14 13:55:29 2006: times() returns: -1, errno=171
>>>>> Fri Jul 14 13:55:30 2006: times() returns: -1, errno=71
>>>>> Fri Jul 14 13:55:31 2006: times() returns: 29, errno=71
>>>>> Fri Jul 14 13:55:32 2006: times() returns: 129, errno=71
>>>>> Fri Jul 14 13:55:33 2006: times() returns: 230, errno=71
>>>>> Fri Jul 14 13:55:34 2006: times() returns: 330, errno=71
>>
>>
>> That is _SOOO_ broken. You need to get your kernel or library fixed.
>> It is COMPLETELY contradictory to the times(2) man page - and any kind
>> of rational system behavior.
>>
>> I've never heard of this.
>
> I agree that this is broken, and it seems to be a libc-bug. I have tried
> with
> SuSE Linux 9.1 / Kernel 2.6.5 / glibc 2.3.3 and SuSE Linux 9.3 / Kernel
> 2.6.13
> / glibc 2.3.4 and the behaviour is the same.
>
> The reason for this behaviour seems to be how return values of system calls
> are treeted by the library: small negativ numbers - in the range -1 to
> -4096
> - are supposed to be error-codes and thus are returned as -1 while the
> errno-variable is set to the negative return value. One can see this
> overall in
> the kernel where system calls do something like "return -EINVAL" or
> "return -ENOSYS":
> The wrapper for system calls in the library translates this to a return
> value of -1
> and sets the errno-variable to EINVAL or ENOSYS.
>
> And the same happens to small negativ numbers during the system call
> times(2), which
> seems to have the standard system call wrapper in the library.
>
> As I don´t know how one could fix that in the library and even if it
> were fixed
> not all people using heartbeat would update their glibc, I thought it
> would be a
> good idea to bypass this bug in the heartbeat-code with the patch I
> included in
> my former post.
OK.
There are several places in the code where we use times(2). I've taken
your ideas and created a new patch for the 2.x CVS(HEAD) branch which I
think covers them all.
I've attached it here.
Wolfgang: Could you test this and see if it works during wraparound?
It's not quite the same as yours, but avoids casting -1 to an unsigned.
I think the effect should be the same.
Please let me/us all know.
--
Alan Robertson <[EMAIL PROTECTED]>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
Index: include/clplumbing/longclock.h
===================================================================
RCS file: /home/cvs/linux-ha/linux-ha/include/clplumbing/longclock.h,v
retrieving revision 1.9
diff -u -r1.9 longclock.h
--- include/clplumbing/longclock.h 22 Jun 2005 14:01:34 -0000 1.9
+++ include/clplumbing/longclock.h 15 Aug 2006 21:35:46 -0000
@@ -39,6 +39,10 @@
*
* The functions provided here are:
*
+ * unsigned long cl_times(void);
+ * A rational wrapper for the times(2) call
+ * for those cases where only the return value
+ * is wanted.
* longclock_t time_longclock(void);
* Returns current time as a longclock_t.
*
@@ -77,6 +81,7 @@
*
* extern const longclock_t zero_longclock;
*/
+extern unsigned long cl_times(void);
#ifdef CLOCK_T_IS_LONG_ENOUGH
# ifndef HAVE_LONGCLOCK_ARITHMETIC
Index: lib/clplumbing/GSource.c
===================================================================
RCS file: /home/cvs/linux-ha/linux-ha/lib/clplumbing/GSource.c,v
retrieving revision 1.84
diff -u -r1.84 GSource.c
--- lib/clplumbing/GSource.c 17 Jul 2006 16:56:47 -0000 1.84
+++ lib/clplumbing/GSource.c 15 Aug 2006 21:35:46 -0000
@@ -970,7 +970,6 @@
*timeoutms = 1000; /* Sigh... */
if (sig_src->signal_triggered) {
- static struct tms dummy_tms_struct;
clock_t now;
clock_t diff;
@@ -980,7 +979,7 @@
return TRUE;
}
/* Otherwise, this is when it was first detected */
- now = times(&dummy_tms_struct);
+ now = cl_times();
diff = now - sig_src->sh_detecttime; /* How long since
signal occurred? */
lc_store(
sig_src->detecttime,
@@ -1004,14 +1003,13 @@
g_assert(IS_SIGSOURCE(sig_src));
if (sig_src->signal_triggered) {
- static struct tms dummy_tms_struct;
clock_t now;
clock_t diff;
if (cmp_longclock(lc_fetch(sig_src->detecttime),
zero_longclock) != 0){
return TRUE;
}
/* Otherwise, this is when it was first detected */
- now = times(&dummy_tms_struct);
+ now = cl_times();
diff = now - sig_src->sh_detecttime;
lc_store(
sig_src->detecttime,
@@ -1072,7 +1070,6 @@
static void
G_main_signal_handler(int nsig)
{
- static struct tms dummy_tms_struct;
GSIGSource* sig_src = NULL;
if(G_main_signal_list == NULL) {
@@ -1095,7 +1092,7 @@
/* Time from first occurance of signal */
if (!sig_src->signal_triggered) {
/* Avoid calling longclock_time() on a signal */
- sig_src->sh_detecttime=times(&dummy_tms_struct);
+ sig_src->sh_detecttime=cl_times();
}
sig_src->signal_triggered = TRUE;
}
Index: lib/clplumbing/cl_random.c
===================================================================
RCS file: /home/cvs/linux-ha/linux-ha/lib/clplumbing/cl_random.c,v
retrieving revision 1.4
diff -u -r1.4 cl_random.c
--- lib/clplumbing/cl_random.c 15 Aug 2006 17:54:46 -0000 1.4
+++ lib/clplumbing/cl_random.c 15 Aug 2006 21:35:46 -0000
@@ -136,7 +136,6 @@
char buf[16];
FILE* fs;
struct timeval tv;
- static struct tms dummy_tms_struct;
const char * randdevname [] = {"/dev/urandom", "/dev/random"};
int idev;
#if 0
@@ -217,14 +216,8 @@
/*
* times(2) returns the number of clock ticks since
* boot. Fairly predictable, but not completely so...
- *
- * The man page claims times(2) can fail, but the
- * correct return result also might include -1 so
- * in practice it's impossible to tell, and since
- * dummy_tms_struct can't be invalid there is no
- * known reason why it should fail.
*/
- return (unsigned int) times(&dummy_tms_struct);
+ return (unsigned int) cl_times();
#if 0
Index: lib/clplumbing/longclock.c
===================================================================
RCS file: /home/cvs/linux-ha/linux-ha/lib/clplumbing/longclock.c,v
retrieving revision 1.20
diff -u -r1.20 longclock.c
--- lib/clplumbing/longclock.c 6 Feb 2006 13:53:53 -0000 1.20
+++ lib/clplumbing/longclock.c 15 Aug 2006 21:35:46 -0000
@@ -24,6 +24,7 @@
#include <portability.h>
#include <unistd.h>
#include <sys/times.h>
+#include <errno.h>
#include <clplumbing/longclock.h>
#include <clplumbing/cl_log.h>
@@ -61,14 +62,59 @@
return Hz;
}
-static struct tms longclock_dummy_tms_struct;
+#ifdef TIMES_ALLOWS_NULL_PARAM
+# define TIMES_PARAM NULL
+#else
+ static struct tms dummy_longclock_tms_struct;
+# define TIMES_PARAM &dummy_longclock_tms_struct
+#endif
+
+unsigned long
+cl_times(void) /* Make times(2) behave rationally on Linux */
+{
+ int save_errno = errno;
+ clock_t ret;
+
+ /*
+ * times(2) really returns an unsigned value ...
+ *
+ * We don't check to see if we got back the error value (-1), because
+ * the only possibility for an error would be if the address of
+ * longclock_dummy_tms_struct was invalid. Since it's a
+ * compiler-generated address, we assume that errors are impossible.
+ * And, unfortunately, it is quite possible for the correct return
+ * from times(2) to be exactly (clock_t)-1. Sigh...
+ *
+ */
+ errno = 0;
+ ret = times(TIMES_PARAM);
+
+/*
+ * This is to work around a bug in the system call interface
+ * for times(2) found in glibc on Linux (and maybe elsewhere)
+ * It changes the return values from -1 to -4096 all into
+ * -1 and then dumps the -(return value) into errno.
+ *
+ * This totally bizarre behavior seems to be widespread in
+ * versions of Linux and glibc.
+ *
+ * Many thanks to Wolfgang Dumhs <wolfgang.dumhs (at) gmx.at>
+ * for finding and documenting this bizarre behavior.
+ */
+ if (errno != 0) {
+ ret = (clock_t) (-errno);
+ }
+ errno = save_errno;
+ return (unsigned long)ret;
+}
+
#ifdef CLOCK_T_IS_LONG_ENOUGH
longclock_t
time_longclock(void)
{
/* See note below about deliberately ignoring errors... */
- return (longclock_t)times(&longclock_dummy_tms_struct);
+ return (longclock_t)cl_times();
}
#else /* clock_t is shorter than 64 bits */
@@ -89,18 +135,8 @@
unsigned long timesval;
++callcount;
- /*
- * times(2) really returns an unsigned value ...
- *
- * We don't check to see if we got back the error value (-1), because
- * the only possibility for an error would be if the address of
- * longclock_dummy_tms_struct was invalid. Since it's a
- * compiler-generated address, we assume that errors are impossible.
- * And, unfortunately, it is quite possible for the correct return
- * from times(2) to be exactly (clock_t)-1. Sigh...
- *
- */
- timesval = (unsigned long) times(&longclock_dummy_tms_struct);
+
+ timesval = (unsigned long) cl_times();
if (calledbefore && timesval < lasttimes) {
clock_t jumpbackby = lasttimes - timesval;
Index: lib/stonith/expect.c
===================================================================
RCS file: /home/cvs/linux-ha/linux-ha/lib/stonith/expect.c,v
retrieving revision 1.21
diff -u -r1.21 expect.c
--- lib/stonith/expect.c 11 May 2005 11:49:20 -0000 1.21
+++ lib/stonith/expect.c 15 Aug 2006 21:35:47 -0000
@@ -44,6 +44,7 @@
#include <netinet/in.h>
#include <arpa/inet.h>
#include <stonith/st_ttylock.h>
+#include <clplumbing/longclock.h>
#define ENABLE_PIL_DEFS_PRIVATE
#include <pils/plugin.h>
@@ -65,6 +66,32 @@
#define STRDUP StonithPIsys->imports->mstrdup
#define FREE(p) {StonithPIsys->imports->mfree(p); (p) = NULL;}
+#ifdef TIMES_ALLOWS_NULL_PARAM
+# define TIMES_PARAM NULL
+#else
+ static struct tms dummy_longclock_tms_struct;
+# define TIMES_PARAM &dummy_longclock_tms_struct
+#endif
+
+static unsigned long
+our_times(void) /* Make times(2) behave rationally on Linux */
+{
+ int save_errno;
+ clock_t ret;
+
+ /*
+ * This code copied from clplumbing/longclock.c to avoid
+ * making STONITH depend on clplumbing. See it for an explanation
+ */
+ errno = 0;
+ ret = times(TIMES_PARAM);
+
+ if (errno != 0) {
+ ret = (clock_t) (-errno);
+ }
+ errno = save_errno;
+ return (unsigned long)ret;
+}
/*
* Look for ('expect') any of a series of tokens in the input
* Return the token type for the given token or -1 on error.
@@ -74,11 +101,6 @@
ExpectToken(int fd, struct Etoken * toklist, int to_secs, char * savebuf
, int maxline, int Debug)
{
- /*
- * We use unsigned long instead of clock_t here because it's signed,
- * but the return value from times() is basically unsigned...
- * This is broken, but according to POSIX ;-)
- */
unsigned long starttime;
unsigned long endtime;
int wraparound=0;
@@ -94,7 +116,7 @@
/* Figure out when to give up. Handle lbolt wraparound */
- starttime = times(NULL);
+ starttime = our_times();
ticks = (to_secs*hz);
endtime = starttime + ticks;
@@ -111,7 +133,7 @@
}
- while (now = times(NULL),
+ while (now = our_times(),
(wraparound && (now > starttime || now <= endtime))
|| (!wraparound && now <= endtime)) {
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/