-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Wolfgang Dumhs wrote:
> Alan Robertson wrote:
>> Wolfgang Dumhs wrote:
>> > Hi,
>> >
>> > im using heartbeat 1.2.3 on linux servers with kernel 2.6.5 on a couple
>> > of systems and some days ago again a machine had problems after an
>> > uptime of 497 days:
>> >
>> > Jul 6 14:20:01 dmask2 heartbeat[13308]: WARN: node dmask1: is dead
>> > Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Dead node dmask1 gave
>> up resources.
>> > Jul 6 14:20:01 dmask2 heartbeat[13308]: WARN: node dmask2: is dead
>> > Jul 6 14:20:01 dmask2 heartbeat[13308]: ERROR: No local heartbeat.
>> Forcing restart.
>> > Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Heartbeat shutdown in
>> progress. (13308)
>> > Jul 6 14:20:01 dmask2 heartbeat[13308]: WARN: node routers: is dead
>> > Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Resource takeover
>> cancelled - shutdown in progress.
>> > Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Link dmask1:eth1 dead.
>> > Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Link dmask1:eth0 dead.
>> > Jul 6 14:20:01 dmask2 heartbeat[13308]: info: Link routers:routers
>> dead.
>> > Jul 6 14:20:01 dmask2 heartbeat[13308]: WARN: Late heartbeat: Node
>> dmask2: interval 41070 ms
>> > ...
>> >
>> > I have found some posts with similar problems, but no solution, so I
>> > began to search, where the failure comes from. And I think I found it:
>> > It is a design problem of the system call times(), which ist used in
>> > longclock.c. Under 32 bit linux, system calls are not designed to
>> return
>> > values between -4095 and -1, because this range is reserved for
>> > error-codes. To check this behaviour, I patched the kernel to begin
>> > returning small negativ numbers after system start:
>> >
>> > #define INITIAL_JIFFIES ((unsigned long)(0x100000000*HZ/USER_HZ -
>> 600*HZ))
>
> This doesn´t work an 32 bit systems, the rigth statement is:
>
> #define INITIAL_JIFFIES ((unsigned long long)(0x100000000*HZ/USER_HZ -
> 600*HZ))
>
> and has to be set in /usr/src/linux/include/linux/time.h
> or in /usr/src/linux/include/jiffies.h, depending on kernel version.
>
>> >
>> > Then I wrote a little program which ouputs the return of times(), and
>> > the errno-Variable, and the ouput is:
>> >
>> > Fri Jul 14 13:54:47 2006: times() returns: -4380, errno=0
>> > Fri Jul 14 13:54:48 2006: times() returns: -4280, errno=0
>> > Fri Jul 14 13:54:49 2006: times() returns: -4180, errno=0
>> > Fri Jul 14 13:54:50 2006: times() returns: -1, errno=4080
>> > Fri Jul 14 13:54:51 2006: times() returns: -1, errno=3980
>> > Fri Jul 14 13:54:52 2006: times() returns: -1, errno=3879
>> > Fri Jul 14 13:54:53 2006: times() returns: -1, errno=3779
>> > ...
>> > Fri Jul 14 13:55:27 2006: times() returns: -1, errno=372
>> > Fri Jul 14 13:55:28 2006: times() returns: -1, errno=272
>> > Fri Jul 14 13:55:29 2006: times() returns: -1, errno=171
>> > Fri Jul 14 13:55:30 2006: times() returns: -1, errno=71
>> > Fri Jul 14 13:55:31 2006: times() returns: 29, errno=71
>> > Fri Jul 14 13:55:32 2006: times() returns: 129, errno=71
>> > Fri Jul 14 13:55:33 2006: times() returns: 230, errno=71
>> > Fri Jul 14 13:55:34 2006: times() returns: 330, errno=71
That is _SOOO_ broken. You need to get your kernel or library fixed.
It is COMPLETELY contradictory to the times(2) man page - and any kind
of rational system behavior.
I've never heard of this.
- --
Alan Robertson <[EMAIL PROTECTED]>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
iD8DBQFE4d6fNkLhYXF6ZA4RAml8AJ9GD4XD+1TTNXdPevSYl3HLpmS4iACfabbE
2xfhvIsHm3x0M0gilwbygsI=
=jTre
-----END PGP SIGNATURE-----
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/