Hi,

I would like to have some information about a problem with one of my
heartbeat cluster.

Here logs of my cluster :


node 1:
heartbeat: 2007/07/25_02:19:45 WARN: node bdtt03: is dead
heartbeat: 2007/07/25_02:19:45 ERROR: No local heartbeat. Forcing restart.
heartbeat: 2007/07/25_02:19:45 info: Heartbeat shutdown in progress. (8577)
heartbeat: 2007/07/25_02:19:45 WARN: node bdtt04: is dead
heartbeat: 2007/07/25_02:19:45 info: Link bdtt04:eth0 dead.
heartbeat: 2007/07/25_02:19:45 info: Giving up all HA resources.
heartbeat: 2007/07/25_02:19:45 info: Link bdtt04:eth1 dead.
heartbeat: 2007/07/25_02:19:45 WARN: Late heartbeat: Node bdtt03: interval
42140 ms
heartbeat: 2007/07/25_02:19:45 info: Link bdtt04:eth0 up.
heartbeat: 2007/07/25_02:19:45 info: Link bdtt04:eth1 up.
heartbeat: 2007/07/25_02:19:46 info: Releasing resource group: bdtt03
IPaddr::160.92.24.45/24/160.92.24.255
[...]
heartbeat: 2007/07/25_02:20:27 info: Heartbeat shutdown complete.
heartbeat: 2007/07/25_02:20:27 info: Heartbeat restart triggered.
heartbeat: 2007/07/25_02:20:27 info: Restarting heartbeat.
heartbeat: 2007/07/25_02:20:27 info: Performing heartbeat restart exec.

node 2:
heartbeat: 2007/07/25_02:19:47 info: Received shutdown notice from 'bdtt03'.
heartbeat: 2007/07/25_02:19:47 info: Resources being acquired from bdtt03.
heartbeat: 2007/07/25_02:19:47 info: acquire all HA resources (standby).
heartbeat: 2007/07/25_02:19:47 info: No local resources
[/usr/lib/heartbeat/ResourceManager listkeys bdtt04] to acquire.
heartbeat: 2007/07/25_02:19:47 info: Acquiring resource group: bdtt03
IPaddr::160.92.24.45/24/160.92.24.255
heartbeat: 2007/07/25_02:19:47 info: Running /etc/ha.d/resource.d/IPaddr
160.92.24.45/24/160.92.24.255 start
heartbeat: 2007/07/25_02:19:48 info: /sbin/ifconfig eth0:0 160.92.24.45
netmask 255.255.255.0  broadcast 160.92.24.255
heartbeat: 2007/07/25_02:19:48 info: Sending Gratuitous Arp for
160.92.24.45on eth0:0 [eth0]
heartbeat: 2007/07/25_02:19:48 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p
/var/lib/heartbeat/rsctmp/send_arp/send_arp-160.92.24.45 eth0 160.92.24.45auto
160.92.24.45 ffffffffffff
heartbeat: 2007/07/25_02:19:49 info: all HA resource acquisition completed
(standby).
heartbeat: 2007/07/25_02:19:49 info: Standby resource acquisition done
[all].
heartbeat: 2007/07/25_02:19:49 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2007/07/25_02:19:49 info: Taking over resource group IPaddr::
160.92.24.45/24/160.92.24.255
heartbeat: 2007/07/25_02:19:49 info: Acquiring resource group: bdtt03
IPaddr::160.92.24.45/24/160.92.24.255
heartbeat: 2007/07/25_02:19:49 info: /usr/lib/heartbeat/mach_down:
nice_failback: foreign resources acquired
heartbeat: 2007/07/25_02:19:49 info: mach_down takeover complete.
heartbeat: 2007/07/25_02:19:49 info: mach_down takeover complete for node
bdtt03.
heartbeat: 2007/07/25_02:20:26 WARN: Late heartbeat: Node bdtt03: interval
39250 ms
heartbeat: 2007/07/25_02:21:07 WARN: node bdtt03: is dead
heartbeat: 2007/07/25_02:21:07 info: Dead node bdtt03 gave up resources.
heartbeat: 2007/07/25_02:21:09 info: Link bdtt03:eth0 dead.
heartbeat: 2007/07/25_02:21:09 info: Link bdtt03:eth1 dead.
heartbeat: 2007/07/25_02:21:09 info: Heartbeat restart on node bdtt03
heartbeat: 2007/07/25_02:21:09 info: Link bdtt03:eth0 up.
heartbeat: 2007/07/25_02:21:09 info: Status update for node bdtt03: status
up
heartbeat: 2007/07/25_02:21:09 info: Link bdtt03:eth1 up.
heartbeat: 2007/07/25_02:21:09 info: Status update for node bdtt03: status
active
heartbeat: 2007/07/25_02:21:09 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2007/07/25_02:21:09 info: remote resource transition completed.
heartbeat: 2007/07/25_02:21:09 info: Running /etc/ha.d/rc.d/status status
----------------

I use a Linux Kernel 2.6.10 (32 bits) and Heartbeat 1.2.3 (450 days of
uptime)

I don't really know what is the reason of the "no local heartbeat" error
message.

I look for an answer in HA mailing list and FAQ but I'm not sure about what
I understand.

So I understand there are two possibilities :
- system overload or I/O overload
- kernel scheduler bug

I verify but my system wasn't overload, neither CPU nor I/O or network were
overloaded.

Problem is I can't really believe in a scheduler bug... I use a preemp
scheduler and I think a process can't stay in "sleeping mode" more than a
few seconds (and when I say "a few" I think 1 or 2 seconds). Stay in
"sleeping mode" more than 40-60s is very unlikely... no ? I suppose a
RealTime (RT) scheduler is not really needed...

My Heartbeat runs on a Bi-Xeon Dual Core and CPU are always between 80-90%
of idle... my system is always able to give cpu time for my heartbeat
process...

So what ? I'm totally lost and I dont have any idea or clue to solve this
problem... Any ideas ?
bdtt03 IPaddr::160.92.24.45/24/160.92.24.255

Attachment: ha.cf
Description: Binary data

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to