Hi,

I have a system with two nodes that had been running heartbeat for a 
while -- Linux HA 2.1.4.  One of the heartbeat processes went to 100% 
CPU usage and stayed there, with the following logs seen:

heartbeat[17464]: 2010/11/21_03:04:07 info: Gmain_timeout_dispatch: 
started at 3846010832 should have started at 3845570140
heartbeat[17464]: 2010/11/21_03:04:08 WARN: Gmain_timeout_dispatch: 
Dispatch function for retransmit request took too long to execute: 400 
ms (> 10 ms) (GSource: 0x18254030)

I tried to shutdown using /etc/init.d/heartbeat stop  -- the shutdown 
hung and ever since then the only way to stop the heartbeat processes is 
by doing a kill (or killall).

When the heartbeat processes are started, only the first few processes 
come up -- heartbeat never fully initializes. The following processes 
never come up:

    /usr/lib/heartbeat/ccm
    /usr/lib/heartbeat/cib
    /usr/lib/heartbeat/lrmd -r
    /usr/lib/heartbeat/stonithd
    /usr/lib/heartbeat/attrd
    /usr/lib/heartbeat/crmd
    /usr/lib/heartbeat/mgmtd -v
    /usr/lib/heartbeat/cibmon -d

These logs are now seen every time a start is attempted:

heartbeat[12339]: 2010/12/08_16:20:23 ERROR: Message hist queue is 
filling up (500 messages in queue)
heartbeat[12339]: 2010/12/08_16:20:23 ERROR: Message hist queue is 
filling up (500 messages in queue)
heartbeat[12339]: 2010/12/08_16:20:23 ERROR: Message hist queue is 
filling up (500 messages in queue)

So, I've gotten heartbeat into a state where it will not start up all 
the processes, and when trying to stop it hangs.  I'm not sure what else 
to look at.  Has anyone seen this kind of behavior before?

Thanks,
Bart
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to