Alan Robertson wrote:
Wolfgang Dumhs wrote:
Alan Robertson wrote:
Wolfgang Dumhs wrote:
Hi,

im using heartbeat 1.2.3 on linux servers with kernel 2.6.5 on a couple
of systems and some days ago again a machine had problems after an
uptime of 497 days:

Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: node dmask1: is dead
Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Dead node dmask1 gave
up resources.
Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: node dmask2: is dead
Jul  6 14:20:01 dmask2 heartbeat[13308]: ERROR: No local heartbeat.
Forcing restart.
Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Heartbeat shutdown in
progress. (13308)
Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: node routers: is dead
Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Resource takeover
cancelled - shutdown in progress.
Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Link dmask1:eth1 dead.
Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Link dmask1:eth0 dead.
Jul  6 14:20:01 dmask2 heartbeat[13308]: info: Link routers:routers
dead.
Jul  6 14:20:01 dmask2 heartbeat[13308]: WARN: Late heartbeat: Node
dmask2: interval 41070 ms
...

I have found some posts with similar problems, but no solution, so I
began to search, where the failure comes from. And I think I found it:
It is a design problem of the system call times(), which ist used in
longclock.c. Under 32 bit linux, system calls are not designed to
return
values between -4095 and -1, because this range is reserved for
error-codes. To check this behaviour, I patched the kernel to begin
returning small negativ numbers after system start:

#define INITIAL_JIFFIES ((unsigned long)(0x100000000*HZ/USER_HZ -
600*HZ))
This doesn´t work an 32 bit systems, the rigth statement is:

#define INITIAL_JIFFIES ((unsigned long long)(0x100000000*HZ/USER_HZ -
600*HZ))

and has to be set in /usr/src/linux/include/linux/time.h
or in /usr/src/linux/include/jiffies.h, depending on kernel version.

Then I wrote a little program which ouputs the return of times(), and
the errno-Variable, and the ouput is:

Fri Jul 14 13:54:47 2006: times() returns: -4380, errno=0
Fri Jul 14 13:54:48 2006: times() returns: -4280, errno=0
Fri Jul 14 13:54:49 2006: times() returns: -4180, errno=0
Fri Jul 14 13:54:50 2006: times() returns: -1, errno=4080
Fri Jul 14 13:54:51 2006: times() returns: -1, errno=3980
Fri Jul 14 13:54:52 2006: times() returns: -1, errno=3879
Fri Jul 14 13:54:53 2006: times() returns: -1, errno=3779
...
Fri Jul 14 13:55:27 2006: times() returns: -1, errno=372
Fri Jul 14 13:55:28 2006: times() returns: -1, errno=272
Fri Jul 14 13:55:29 2006: times() returns: -1, errno=171
Fri Jul 14 13:55:30 2006: times() returns: -1, errno=71
Fri Jul 14 13:55:31 2006: times() returns: 29, errno=71
Fri Jul 14 13:55:32 2006: times() returns: 129, errno=71
Fri Jul 14 13:55:33 2006: times() returns: 230, errno=71
Fri Jul 14 13:55:34 2006: times() returns: 330, errno=71


That is _SOOO_ broken.  You need to get your kernel or library fixed.
It is COMPLETELY contradictory to the times(2) man page - and any kind
of rational system behavior.

I've never heard of this.

I agree that this is broken, and it seems to be a libc-bug. I have tried with
SuSE Linux 9.1 / Kernel 2.6.5 / glibc 2.3.3 and SuSE Linux 9.3 / Kernel 2.6.13
/ glibc 2.3.4 and the behaviour is the same.

The reason for this behaviour seems to be how return values of system calls
are treeted by the library: small negativ numbers - in the range -1 to -4096
- are supposed to be error-codes and thus are returned as -1 while the
errno-variable is set to the negative return value. One can see this overall in
the kernel where system calls do something like "return -EINVAL" or "return 
-ENOSYS":
The wrapper for system calls in the library translates this to a return value 
of -1
and sets the errno-variable to EINVAL or ENOSYS.

And the same happens to small negativ numbers during the system call times(2), 
which
seems to have the standard system call wrapper in the library.

As I don´t know how one could fix that in the library and even if it were fixed
not all people using heartbeat would update their glibc, I thought it would be a
good idea to bypass this bug in the heartbeat-code with the patch I included in
my former post.

--
Wolfgang Dumhs
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to