On Mon, Feb 02, 2009 at 03:26:20PM -0800, Gruher, Joseph R wrote: > Hello- > > We are developing our own STONITH plugin for our blade server > and having an issue we are hoping this list can help us with. > Our STONITH plugin script works well some of the time (the bad > node is reset and the resources fail over) but does not work > some of the time (resources never fail over). We have been > looking in the messages log (see examples below) and it appears > in the non-working case that the reset call to our plugin never > occurs, even though the getconfignames, status and gethosts > calls that normally lead up to it are made. Are there any > common problem that could cause this behavior? Any suggestions > how we can continue to debug this problem? Other logs we > should be looking in? Would it be useful to send our plugin or > any other files from the system? > > Thanks very much for any and all input. We are testing with SLES10.2 x64.
The "works" log looks fine. In the "not works" log, there's no request for stonith from tengine (look for tengine.*reboot). Thanks, Dejan > -Joe > > > When it works: > > Jan 28 11:33:09 25node2 crmd: [6325]: info: do_lrm_rsc_op: Performing > op=STONITH_v1.1_start_0 key=16:47:676b4bb5-523b-49c5-a5f3-a228f2af8149) > Jan 28 11:33:09 25node2 lrmd: [6322]: info: rsc:STONITH_v1.1: start > Jan 28 11:33:09 25node2 lrmd: [14675]: info: Try to start STONITH resource > <rsc_id=STONITH_v1.1> : Device=external/MFSYS_STONITH_PLUGIN_v1.1 > Jan 28 11:33:09 25node2 MFSYS_STONITH_PLUGIN_v1.1[14676]: getconfignames > (node2slot=; slot=) > Jan 28 11:33:09 25node2 MFSYS_STONITH_PLUGIN_v1.1[14676]: exiting script with > an rc=0 > Jan 28 11:33:09 25node2 MFSYS_STONITH_PLUGIN_v1.1[14687]: status > (node2slot=25node1=1,25node2=2,25node3=3; slot=) > Jan 28 11:33:09 25node2 ccm: [6320]: debug: quorum plugin: majority > Jan 28 11:33:09 25node2 ccm: [6320]: debug: cluster:linux-ha, member_count=2, > member_quorum_votes=200 > Jan 28 11:33:09 25node2 cib: [6321]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Jan 28 11:33:09 25node2 ccm: [6320]: debug: total_node_count=3, > total_quorum_votes=300 > Jan 28 11:33:09 25node2 cib: [6321]: info: mem_handle_event: no mbr_track info > Jan 28 11:33:09 25node2 ccm: [6320]: debug: quorum plugin: majority > Jan 28 11:33:09 25node2 cib: [6321]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Jan 28 11:33:09 25node2 ccm: [6320]: debug: cluster:linux-ha, member_count=2, > member_quorum_votes=200 > Jan 28 11:33:09 25node2 crmd: [6325]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Jan 28 11:33:09 25node2 cib: [6321]: info: mem_handle_event: instance=16, > nodes=2, new=0, lost=1, n_idx=0, new_idx=2, old_idx=5 > Jan 28 11:33:09 25node2 ccm: [6320]: debug: total_node_count=3, > total_quorum_votes=300 > Jan 28 11:33:09 25node2 crmd: [6325]: info: mem_handle_event: no mbr_track > info > Jan 28 11:33:09 25node2 cib: [6321]: info: cib_ccm_msg_callback: LOST: 25node1 > Jan 28 11:33:09 25node2 crmd: [6325]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Jan 28 11:33:09 25node2 cib: [6321]: info: cib_ccm_msg_callback: PEER: 25node2 > Jan 28 11:33:09 25node2 crmd: [6325]: info: mem_handle_event: instance=16, > nodes=2, new=0, lost=1, n_idx=0, new_idx=2, old_idx=5 > Jan 28 11:33:09 25node2 cib: [6321]: info: cib_ccm_msg_callback: PEER: 25node3 > Jan 28 11:33:09 25node2 crmd: [6325]: info: crmd_ccm_msg_callback: Quorum > (re)attained after event=NEW MEMBERSHIP (id=16) > Jan 28 11:33:09 25node2 crmd: [6325]: info: ccm_event_detail: NEW MEMBERSHIP: > trans=16, nodes=2, new=0, lost=1 n_idx=0, new_idx=2, old_idx=5 > Jan 28 11:33:09 25node2 crmd: [6325]: info: ccm_event_detail: CURRENT: > 25node2 [nodeid=1, born=2] > Jan 28 11:33:09 25node2 crmd: [6325]: info: ccm_event_detail: CURRENT: > 25node3 [nodeid=2, born=3] > Jan 28 11:33:09 25node2 crmd: [6325]: info: ccm_event_detail: LOST: > 25node1 [nodeid=0, born=15] > Jan 28 11:33:09 25node2 mgmtd: [6326]: ERROR: unpack_rsc_op: Remapping > WebServer_monitor_0 (rc=1) on 25node2 to an ERROR > Jan 28 11:33:09 25node2 mgmtd: [6326]: ERROR: unpack_rsc_op: Remapping > WebServer_monitor_0 (rc=1) on 25node3 to an ERROR > Jan 28 11:33:09 25node2 mgmtd: [6326]: ERROR: unpack_rsc_op: Remapping > WebServer_monitor_0 (rc=1) on 25node2 to an ERROR > Jan 28 11:33:09 25node2 mgmtd: [6326]: ERROR: unpack_rsc_op: Remapping > WebServer_monitor_0 (rc=1) on 25node3 to an ERROR > Jan 28 11:33:10 25node2 haclient: on_event: from message queue: > evt:cib_changed > Jan 28 11:33:11 25node2 MFSYS_STONITH_PLUGIN_v1.1[14687]: exiting script with > an rc=0 > Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14703]: gethosts > (node2slot=25node1=1,25node2=2,25node3=3; slot=) > Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14703]: exiting script with > an rc=0 > Jan 28 11:33:12 25node2 crmd: [6325]: info: process_lrm_event: LRM operation > STONITH_v1.1_start_0 (call=24, rc=0) complete > Jan 28 11:33:12 25node2 tengine: [6860]: info: match_graph_event: Action > STONITH_v1.1_start_0 (16) confirmed on 25node2 (rc=0) > Jan 28 11:33:12 25node2 tengine: [6860]: info: te_pseudo_action: Pseudo > action 17 fired and confirmed > Jan 28 11:33:12 25node2 tengine: [6860]: info: te_fence_node: Executing > reboot fencing operation (18) on 25node1 (timeout=50000) > Jan 28 11:33:12 25node2 haclient: on_event:evt:cib_changed > Jan 28 11:33:12 25node2 stonithd: [6323]: info: client tengine [pid: 6860] > want a STONITH operation RESET to node 25node1. > Jan 28 11:33:12 25node2 stonithd: [6323]: info: > stonith_operate_locally::2368: sending fencing op (RESET) for 25node1 to > device external (rsc_id=STONITH_v1.1, pid=14713) > Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14714]: reset 25node1 > (node2slot=25node1=1,25node2=2,25node3=3; slot=1) > Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14714]: retry=0 > bladeState=-1 powerstate=-32 > Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14714]: exiting script with > an rc=0 > > > When it fails: > > Jan 30 15:54:27 vs-cb03-5cl crmd: [5116]: info: do_lrm_rsc_op: Performing > op=CBSTONITH_monitor_0 key=4:8:469db639-f549-4582-9735-2b5e89d147c2) > Jan 30 15:54:27 vs-cb03-5cl lrmd: [5113]: info: rsc:CBSTONITH: monitor > Jan 30 15:54:27 vs-cb03-5cl crmd: [5116]: info: process_lrm_event: LRM > operation CBSTONITH_monitor_0 (call=20, rc=7) complete > Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml (digest: > /var/lib/heartbeat/crm/cib.xml.sig) > Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml (digest: > /var/lib/heartbeat/crm/cib.xml.sig) > Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: > /var/lib/heartbeat/crm/cib.xml.sig.last) > Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: write_cib_contents: Wrote > version 0.1175.3 of the CIB to disk (digest: 73c66f86b10cd0b1bc58a3beab870faa) > Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml (digest: > /var/lib/heartbeat/crm/cib.xml.sig) > Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: > /var/lib/heartbeat/crm/cib.xml.sig.last) > Jan 30 15:54:28 vs-cb03-5cl crmd: [5116]: info: do_lrm_rsc_op: Performing > op=CBSTONITH_start_0 key=19:8:469db639-f549-4582-9735-2b5e89d147c2) > Jan 30 15:54:28 vs-cb03-5cl lrmd: [5113]: info: rsc:CBSTONITH: start > Jan 30 15:54:28 vs-cb03-5cl lrmd: [30768]: info: Try to start STONITH > resource <rsc_id=CBSTONITH> : Device=external/MFSYS_STONITH_PLUGIN_v1.1 > Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml (digest: > /var/lib/heartbeat/crm/cib.xml.sig) > Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml (digest: > /var/lib/heartbeat/crm/cib.xml.sig) > Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: > /var/lib/heartbeat/crm/cib.xml.sig.last) > Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: write_cib_contents: Wrote > version 0.1175.5 of the CIB to disk (digest: f2c35efbf0c7066043202d754cf607bb) > Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml (digest: > /var/lib/heartbeat/crm/cib.xml.sig) > Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: > /var/lib/heartbeat/crm/cib.xml.sig.last) > Jan 30 15:54:29 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30769]: getconfignames > (node2slot=; slot=) > Jan 30 15:54:29 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30769]: exiting script > with an rc=0 > Jan 30 15:54:29 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30780]: status > (node2slot=vs-cb03-2cl=2,vs-cb03-5cl=5,vs-cb03-6cl=6; slot=) > Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml (digest: > /var/lib/heartbeat/crm/cib.xml.sig) > Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml (digest: > /var/lib/heartbeat/crm/cib.xml.sig) > Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: > /var/lib/heartbeat/crm/cib.xml.sig.last) > Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: write_cib_contents: Wrote > version 0.1175.7 of the CIB to disk (digest: 551519e2dfb361ccf0f4303167997435) > Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml (digest: > /var/lib/heartbeat/crm/cib.xml.sig) > Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: > /var/lib/heartbeat/crm/cib.xml.sig.last) > Jan 30 15:54:31 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30780]: exiting script > with an rc=0 > Jan 30 15:54:31 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30797]: gethosts > (node2slot=vs-cb03-2cl=2,vs-cb03-5cl=5,vs-cb03-6cl=6; slot=) > Jan 30 15:54:31 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30797]: JOE02: hostlist > = vs-cb03-2cl vs-cb03-5cl vs-cb03-6cl > Jan 30 15:54:31 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30797]: exiting script > with an rc=0 > Jan 30 15:54:31 vs-cb03-5cl crmd: [5116]: info: process_lrm_event: LRM > operation CBSTONITH_start_0 (call=21, rc=0) complete > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
