hi there,

we have got an 12 node cluster for managing KVM based virtual machines.
we are using fedora 12 for the node systems with pacemaker
(pacemaker-1.0.7-1.fc12.x86_64) and heartbeat
(heartbeat-3.0.0-0.7.0daab7da36a8.hg.fc12.x86_64).

we had a crash of heartbeat with SIGXCPU

Jan  2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBREAD process
25702 killed by signal 24 [SIGXCPU - CPU limit exceeded].
Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: Managed HBREAD process
25702 dumped core
Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: HBREAD process died.
Beginning communications restart process for comm channel 0.
Jan  2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBWRITE process
25701 killed by signal 9 [SIGKILL - Kill, unblockable].
Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: Both comm processes
for channel 0 have died.  Restarting.
Jan  2 01:21:11 node09 heartbeat: [31328]: info: glib: UDP multicast
heartbeat started for group 239.0.0.4 port 694 interface br_vlan1040
(ttl=1 loop=0)
Jan  2 01:21:11 node09 heartbeat: [31328]: info: Communications restart
succeeded.
Jan  2 01:21:12 node09 heartbeat: [22135]: info: Stack hogger failed
0xffffffff
Jan  2 01:21:12 node09 heartbeat: [22136]: info: Stack hogger failed
0xffffffff

we figured out that if debug mode is turned on, heartbeat is setting a
max cpu time limit to 4143 (you can see that in the
cat /proc/<heartbeat-pid>/limits file). if debug mode is turned off you
dont have that limit.

directly after the heartbeat crash the pacemaker ping RA is not working
any more, it is producing only syntax errors:

Jan  2 01:21:24 node09 lrmd: [31341]: info: RA output:
(pingd_stornet:8:monitor:stderr) expr: syntax error
Jan  2 01:21:24 node09 attrd_updater: [22148]: info: Invoked:
attrd_updater -n pingd_stornet -v -d 5s 
Jan  2 01:21:24 node09 attrd_updater: [22148]: info: attrd_lazy_update:
Connecting to cluster... 5 retries remaining
Jan  2 01:21:38 node09 lrmd: [31341]: info: RA output:
(pingd_stornet:8:monitor:stderr) expr: syntax error
Jan  2 01:21:38 node09 attrd_updater: [22172]: info: Invoked:
attrd_updater -n pingd_stornet -v -d 5s 
Jan  2 01:21:38 node09 attrd_updater: [22172]: info: attrd_lazy_update:
Connecting to cluster... 5 retries remaining
Jan  2 01:21:52 node09 lrmd: [31341]: info: RA output:
(pingd_stornet:8:monitor:stderr) expr: syntax error
Jan  2 01:21:52 node09 attrd_updater: [22191]: info: Invoked:
attrd_updater -n pingd_stornet -v -d 5s

on every machine that had that SIGXCPU crash ping RA is not working any
more.

my questions are:
- do we have to turn debug mode off to get rid of the max cpu time
limit? is that the right thing to do, or are we using to much cpu time
for the heartbeat process?
- how to fix the ping RA? is my cluster somehow screwed up, that ping RA
not working any more?

bests

daniel

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to