hi there, we have got an 12 node cluster for managing KVM based virtual machines. we are using fedora 12 for the node systems with pacemaker (pacemaker-1.0.7-1.fc12.x86_64) and heartbeat (heartbeat-3.0.0-0.7.0daab7da36a8.hg.fc12.x86_64).
we had a crash of heartbeat with SIGXCPU Jan 2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBREAD process 25702 killed by signal 24 [SIGXCPU - CPU limit exceeded]. Jan 2 01:21:11 node09 heartbeat: [31328]: ERROR: Managed HBREAD process 25702 dumped core Jan 2 01:21:11 node09 heartbeat: [31328]: ERROR: HBREAD process died. Beginning communications restart process for comm channel 0. Jan 2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBWRITE process 25701 killed by signal 9 [SIGKILL - Kill, unblockable]. Jan 2 01:21:11 node09 heartbeat: [31328]: ERROR: Both comm processes for channel 0 have died. Restarting. Jan 2 01:21:11 node09 heartbeat: [31328]: info: glib: UDP multicast heartbeat started for group 239.0.0.4 port 694 interface br_vlan1040 (ttl=1 loop=0) Jan 2 01:21:11 node09 heartbeat: [31328]: info: Communications restart succeeded. Jan 2 01:21:12 node09 heartbeat: [22135]: info: Stack hogger failed 0xffffffff Jan 2 01:21:12 node09 heartbeat: [22136]: info: Stack hogger failed 0xffffffff we figured out that if debug mode is turned on, heartbeat is setting a max cpu time limit to 4143 (you can see that in the cat /proc/<heartbeat-pid>/limits file). if debug mode is turned off you dont have that limit. directly after the heartbeat crash the pacemaker ping RA is not working any more, it is producing only syntax errors: Jan 2 01:21:24 node09 lrmd: [31341]: info: RA output: (pingd_stornet:8:monitor:stderr) expr: syntax error Jan 2 01:21:24 node09 attrd_updater: [22148]: info: Invoked: attrd_updater -n pingd_stornet -v -d 5s Jan 2 01:21:24 node09 attrd_updater: [22148]: info: attrd_lazy_update: Connecting to cluster... 5 retries remaining Jan 2 01:21:38 node09 lrmd: [31341]: info: RA output: (pingd_stornet:8:monitor:stderr) expr: syntax error Jan 2 01:21:38 node09 attrd_updater: [22172]: info: Invoked: attrd_updater -n pingd_stornet -v -d 5s Jan 2 01:21:38 node09 attrd_updater: [22172]: info: attrd_lazy_update: Connecting to cluster... 5 retries remaining Jan 2 01:21:52 node09 lrmd: [31341]: info: RA output: (pingd_stornet:8:monitor:stderr) expr: syntax error Jan 2 01:21:52 node09 attrd_updater: [22191]: info: Invoked: attrd_updater -n pingd_stornet -v -d 5s on every machine that had that SIGXCPU crash ping RA is not working any more. my questions are: - do we have to turn debug mode off to get rid of the max cpu time limit? is that the right thing to do, or are we using to much cpu time for the heartbeat process? - how to fix the ping RA? is my cluster somehow screwed up, that ping RA not working any more? bests daniel _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
