Hi,
I have a system with two nodes that had been running heartbeat for a
while -- Linux HA 2.1.4. One of the heartbeat processes went to 100%
CPU usage and stayed there, with the following logs seen:
heartbeat[17464]: 2010/11/21_03:04:07 info: Gmain_timeout_dispatch:
started at 3846010832 should have started at 3845570140
heartbeat[17464]: 2010/11/21_03:04:08 WARN: Gmain_timeout_dispatch:
Dispatch function for retransmit request took too long to execute: 400
ms (> 10 ms) (GSource: 0x18254030)
I tried to shutdown using /etc/init.d/heartbeat stop -- the shutdown
hung and ever since then the only way to stop the heartbeat processes is
by doing a kill (or killall).
When the heartbeat processes are started, only the first few processes
come up -- heartbeat never fully initializes. The following processes
never come up:
/usr/lib/heartbeat/ccm
/usr/lib/heartbeat/cib
/usr/lib/heartbeat/lrmd -r
/usr/lib/heartbeat/stonithd
/usr/lib/heartbeat/attrd
/usr/lib/heartbeat/crmd
/usr/lib/heartbeat/mgmtd -v
/usr/lib/heartbeat/cibmon -d
These logs are now seen every time a start is attempted:
heartbeat[12339]: 2010/12/08_16:20:23 ERROR: Message hist queue is
filling up (500 messages in queue)
heartbeat[12339]: 2010/12/08_16:20:23 ERROR: Message hist queue is
filling up (500 messages in queue)
heartbeat[12339]: 2010/12/08_16:20:23 ERROR: Message hist queue is
filling up (500 messages in queue)
So, I've gotten heartbeat into a state where it will not start up all
the processes, and when trying to stop it hangs. I'm not sure what else
to look at. Has anyone seen this kind of behavior before?
Thanks,
Bart
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems