I've been seeing an occasional problem where resources are not restarted when the node running them is powered off; in the specific case in question, I have a 2-node cluster (with STONITH), the resources are running on one node (node1) and the DC is on the other node (node0) -- after power cycling node1, I see the attached in the ha log file and then nothing else until about 3 minutes later when the other node comes back.
Any pointers to what to look at would be appreciated... it seems to me that it is related the order in which the various events are processed but I cant quite follow the code... Thanks, Simon Nov 30 13:40:03 node0 heartbeat: [19665]: WARN: node node1: is dead Nov 30 13:40:03 node0 heartbeat: [19665]: info: Link node1:priv0 dead. Nov 30 13:40:03 node0 heartbeat: [19665]: info: Link node1:biz0 dead. Nov 30 13:40:03 node0 crmd: [24367]: notice: crmd_ha_status_callback: Status update: Node node1 now has status [dead] Nov 30 13:40:03 node0 cib: [24362]: info: cib_diff_notify: Local-only Change (client:24367, call: 66): 0.2.51 (ok) Nov 30 13:40:03 node0 ccm: [24360]: debug: quorum plugin: majority Nov 30 13:40:03 node0 ccm: [24360]: debug: cluster:linux-ha, member_count=1, member_quorum_votes=100 Nov 30 13:40:03 node0 ccm: [24360]: debug: total_node_count=2, total_quorum_votes=200 Nov 30 13:40:03 node0 ccm: [24360]: debug: quorum plugin: twonodes Nov 30 13:40:03 node0 ccm: [24360]: debug: cluster:linux-ha, member_count=1, member_quorum_votes=100 Nov 30 13:40:03 node0 ccm: [24360]: debug: total_node_count=2, total_quorum_votes=200 Nov 30 13:40:03 node0 ccm: [24360]: info: Break tie for 2 nodes cluster Nov 30 13:40:03 node0 tengine: [25399]: info: te_update_diff: Processing diff (cib_update): 0.2.51 -> 0.2.51 Nov 30 13:40:03 node0 tengine: [25399]: WARN: match_down_event: No match for shutdown action on 44454c4c-3700-104e-8035-b5c04f504431 Nov 30 13:40:03 node0 crmd: [24367]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Nov 30 13:40:03 node0 tengine: [25399]: info: extract_event: Stonith/shutdown of 44454c4c-3700-104e-8035-b5c04f504431 not matched Nov 30 13:40:03 node0 crmd: [24367]: info: mem_handle_event: no mbr_track info Nov 30 13:40:03 node0 tengine: [25399]: info: update_abort_priority: Abort priority upgraded to 1000000 Nov 30 13:40:03 node0 crmd: [24367]: info: do_state_transition: node0: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ] Nov 30 13:40:03 node0 tengine: [25399]: info: te_update_diff: Aborting on transient_attributes deletions Nov 30 13:40:03 node0 crmd: [24367]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Nov 30 13:40:03 node0 cib: [24362]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Nov 30 13:40:03 node0 cib: [24362]: info: mem_handle_event: no mbr_track info Nov 30 13:40:03 node0 cib: [24362]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Nov 30 13:40:03 node0 cib: [24362]: info: mem_handle_event: instance=4, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3 Nov 30 13:40:03 node0 cib: [24362]: info: cib_ccm_msg_callback: LOST: node1 Nov 30 13:40:03 node0 cib: [24362]: info: cib_ccm_msg_callback: PEER: node0 Nov 30 13:40:03 node0 crmd: [24367]: info: do_pe_invoke_callback: Waiting for another CCM event before proceeding: CIB=4 > CRM=3 Nov 30 13:40:03 node0 crmd: [24367]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Nov 30 13:40:03 node0 crmd: [24367]: info: mem_handle_event: instance=4, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3 Nov 30 13:40:03 node0 crmd: [24367]: info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=4) Nov 30 13:40:03 node0 crmd: [24367]: info: ccm_event_detail: NEW MEMBERSHIP: trans=4, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3 Nov 30 13:40:03 node0 crmd: [24367]: info: ccm_event_detail: CURRENT: node0 [nodeid=0, born=4] Nov 30 13:40:03 node0 crmd: [24367]: info: ccm_event_detail: LOST: node1 [nodeid=1, born=3] Nov 30 13:40:04 node0 cib: [24362]: info: cib_diff_notify: Local-only Change (client:24367, call: 69): 0.2.51 (ok) Nov 30 13:40:04 node0 tengine: [25399]: info: te_update_diff: Processing diff (cib_update): 0.2.51 -> 0.2.51 Nov 30 13:40:04 node0 cib: [907]: info: write_cib_contents: Wrote version 0.2.51 of the CIB to disk (digest: a0c6f0b8d31bfac96182e1bc2d02cad0) Nov 30 13:40:04 node0 cib: [908]: info: write_cib_contents: Wrote version 0.2.51 of the CIB to disk (digest: 66b77c6223da9fcedcadcbac04fd0d67) _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
