As you guys recall, I have set up a heartbeat/drbd based system to replace
an aging drbd solution.

While it sits there, it has not been activated.

I have noticed (due to some self checking scripts) that heartbeat died on
one machine.

Looking in logs, I found this in ha-log.2:

Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: WARN: Managed HBREAD process
3279 killed by signal 24 [SIGXCPU - CPU limit exceeded].
Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: ERROR: Managed HBREAD process
3279 dumped core
Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: ERROR: HBREAD process died.
 Beginning communications restart process for comm channel 0.
Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: info: glib: UDP Broadcast
heartbeat closed on port 12694 interface eth1 - Status: 1
Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: WARN: Managed HBWRITE process
3278 killed by signal 9 [SIGKILL - Kill, unblockable].
Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: ERROR: Both comm processes for
channel 0 have died.  Restarting.
Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: info: glib: UDP Broadcast
heartbeat started on port 12694 (12694) interface eth1
Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: info: glib: UDP Broadcast
heartbeat closed on port 12694 interface eth1 - Status: 1
Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: info: Communications restart
succeeded.
Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Emergency Shutdown: Master
Control process died.
Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Killing pid 1243 with
SIGTERM
Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Killing pid 7247 with
SIGTERM
Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Killing pid 7248 with
SIGTERM
Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Emergency Shutdown(MCP
dead): Killing ourselves.

It looks like heartbeat had a couple of issues, one is dying from SIGXCPU,
and another is dying from master control process. Any ideas as to why this
could have happened?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to