Hi, On Wed, May 26, 2010 at 11:05:05AM -0500, Sam Reidland wrote: > I have been working on a simple 2 node 2 resource cluster using > Pacemaker 1.0.7 and heartbeat 3.0.2. The two resources are IPaddr and > our application. When our application was started, the box would reboot > (actually a clean restart). After a lot of searching I found that if I > didn't initialize net-SNMP, everything started perfectly. The build of > net-SNMP we use spits 2 or 3 lines to stderr when it starts and I > noticed that the reboot occurred after the first line to stderr was > printed and no other output was seen after that. My OCF script started > our app with the following command '/BACKHAUL/bhApplication >/dev/null > &'. I changed the command to '/BACKHAUL/bhApplication &>/dev/null &' and > everything works as it should. So the question is, why does the HA > software cause the box to reboot when something is sent to stderr?
Normally it shouldn't. Actually, whatever is caught on stderr gets logged by lrmd. Your clock is obviously not set. You should use ntp to sync clocks on all nodes. There's something wrong with your installation, i.e. some directories are missing: > Jan 1 00:14:23 bh130 daemon.err crmd: [1148]: ERROR: crm_log_init: > Cannot change active directory to > /usr/var/lib/heartbeat/cores/hacluster: No such file or directory (2) Where did you get the packages? You should start ha_logd. Or otherwise fix the logging setup. > Jan 1 00:14:07 bh130 daemon.warn ccm: [1143]: WARN: Initializing > connection to logging daemon failed. Logging daemon may not be running > I'm > not even sure what part caused the box to reboot. You probably have "crm on" in ha.cf. If one of the subsystems leaves, it's considered as a reason to reboot. You can use "crm respawn" to prevent reboots. > I have included the log from a session in which the box rebooted. Better to attach logs instead of pasting. > Jan 1 00:15:39 bh130 daemon.info lrmd: [1145]: info: rsc:bhApp:3: start > Jan 1 00:15:39 bh130 daemon.info pengine: [1150]: info: > process_pe_message: Transition 1: PEngine Input stored in: > /usr/var/lib/pengine/pe-input-35.bz2 > Jan 1 00:15:39 bh130 daemon.info lrmd: [1145]: info: RA output: > (bhApp:start:stderr) sh: you need to specify whom to kill > Jan 1 00:15:39 bh130 daemon.info crmd: [1148]: info: process_lrm_event: > LRM operation bhApp_start_0 (call=3, rc=0, cib-update=27, confirmed=true) ok > Jan 1 00:15:39 bh130 daemon.info crmd: [1148]: info: match_graph_event: > Action bhApp_start_0 (4) confirmed on bh130 (rc=0) > Jan 1 00:15:39 bh130 daemon.info crmd: [1148]: info: te_rsc_command: > Initiating action 5: monitor bhApp_monitor_240000 on bh130 (local) > Jan 1 00:15:39 bh130 daemon.info crmd: [1148]: info: do_lrm_rsc_op: > Performing key=5:1:0:fb4c1819-0064-4651-a330-a2071dc4e495 > op=bhApp_monitor_240000 ) > Jan 1 00:15:40 bh130 daemon.info crmd: [1148]: info: process_lrm_event: > LRM operation bhApp_monitor_240000 (call=4, rc=0, cib-update=28, > confirmed=false) ok > Jan 1 00:15:40 bh130 daemon.info crmd: [1148]: info: match_graph_event: > Action bhApp_monitor_240000 (5) confirmed on bh130 (rc=0) > Jan 1 00:15:40 bh130 daemon.info crmd: [1148]: info: run_graph: > ==================================================== > Jan 1 00:15:40 bh130 daemon.notice crmd: [1148]: notice: run_graph: > Transition 1 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/usr/var/lib/pengine/pe-input-35.bz2): Complete > Jan 1 00:15:40 bh130 daemon.info crmd: [1148]: info: te_graph_trigger: > Transition 1 is now complete > Jan 1 00:15:40 bh130 daemon.info crmd: [1148]: info: notify_crmd: > Transition 1 status: done - <null> > Jan 1 00:15:40 bh130 daemon.info crmd: [1148]: info: > do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ > input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] > Jan 1 00:15:40 bh130 daemon.info crmd: [1148]: info: > do_state_transition: Starting PEngine Recheck Timer > Jan 1 00:15:45 bh130 daemon.crit crmd: [1148]: CRIT: > lrm_connection_destroy: LRM Connection failed lrmd crashed. Did you find any coredumps? If so, please provide backtrace. If not, then enable coredumps, reproduce, and file a bugzilla with hb_report. And which cluster-glue version? Thanks, Dejan _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
