Oops - forgot to mention HB version - 2.1.0; really just looking for pointers to help me debug here...
Simon > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:linux-ha- > [EMAIL PROTECTED] On Behalf Of Graham, Simon > Sent: Monday, December 03, 2007 2:33 PM > To: General Linux-HA mailing list > Subject: [Linux-HA] HB sometimes forgets to migrate my resources... > > I've been seeing an occasional problem where resources are not > restarted > when the node running them is powered off; in the specific case in > question, I have a 2-node cluster (with STONITH), the resources are > running on one node (node1) and the DC is on the other node (node0) -- > after power cycling node1, I see the attached in the ha log file and > then nothing else until about 3 minutes later when the other node comes > back. > > Any pointers to what to look at would be appreciated... it seems to me > that it is related the order in which the various events are processed > but I cant quite follow the code... > > Thanks, > Simon > > Nov 30 13:40:03 node0 heartbeat: [19665]: WARN: node node1: is dead > Nov 30 13:40:03 node0 heartbeat: [19665]: info: Link node1:priv0 dead. > Nov 30 13:40:03 node0 heartbeat: [19665]: info: Link node1:biz0 dead. > Nov 30 13:40:03 node0 crmd: [24367]: notice: crmd_ha_status_callback: > Status update: Node node1 now has status [dead] > Nov 30 13:40:03 node0 cib: [24362]: info: cib_diff_notify: Local-only > Change (client:24367, call: 66): 0.2.51 (ok) > Nov 30 13:40:03 node0 ccm: [24360]: debug: quorum plugin: majority > Nov 30 13:40:03 node0 ccm: [24360]: debug: cluster:linux-ha, > member_count=1, member_quorum_votes=100 > Nov 30 13:40:03 node0 ccm: [24360]: debug: total_node_count=2, > total_quorum_votes=200 > Nov 30 13:40:03 node0 ccm: [24360]: debug: quorum plugin: twonodes > Nov 30 13:40:03 node0 ccm: [24360]: debug: cluster:linux-ha, > member_count=1, member_quorum_votes=100 > Nov 30 13:40:03 node0 ccm: [24360]: debug: total_node_count=2, > total_quorum_votes=200 > Nov 30 13:40:03 node0 ccm: [24360]: info: Break tie for 2 nodes cluster > Nov 30 13:40:03 node0 tengine: [25399]: info: te_update_diff: > Processing > diff (cib_update): 0.2.51 -> 0.2.51 > Nov 30 13:40:03 node0 tengine: [25399]: WARN: match_down_event: No > match > for shutdown action on 44454c4c-3700-104e-8035-b5c04f504431 > Nov 30 13:40:03 node0 crmd: [24367]: info: mem_handle_event: Got an > event OC_EV_MS_INVALID from ccm > Nov 30 13:40:03 node0 tengine: [25399]: info: extract_event: > Stonith/shutdown of 44454c4c-3700-104e-8035-b5c04f504431 not matched > Nov 30 13:40:03 node0 crmd: [24367]: info: mem_handle_event: no > mbr_track info > Nov 30 13:40:03 node0 tengine: [25399]: info: update_abort_priority: > Abort priority upgraded to 1000000 > Nov 30 13:40:03 node0 crmd: [24367]: info: do_state_transition: node0: > State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_IPC_MESSAGE origin=route_message ] > Nov 30 13:40:03 node0 tengine: [25399]: info: te_update_diff: Aborting > on transient_attributes deletions > Nov 30 13:40:03 node0 crmd: [24367]: info: do_state_transition: All 2 > cluster nodes are eligible to run resources. > Nov 30 13:40:03 node0 cib: [24362]: info: mem_handle_event: Got an > event > OC_EV_MS_INVALID from ccm > Nov 30 13:40:03 node0 cib: [24362]: info: mem_handle_event: no > mbr_track > info > Nov 30 13:40:03 node0 cib: [24362]: info: mem_handle_event: Got an > event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Nov 30 13:40:03 node0 cib: [24362]: info: mem_handle_event: instance=4, > nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3 > Nov 30 13:40:03 node0 cib: [24362]: info: cib_ccm_msg_callback: LOST: > node1 > Nov 30 13:40:03 node0 cib: [24362]: info: cib_ccm_msg_callback: PEER: > node0 > Nov 30 13:40:03 node0 crmd: [24367]: info: do_pe_invoke_callback: > Waiting for another CCM event before proceeding: CIB=4 > CRM=3 > Nov 30 13:40:03 node0 crmd: [24367]: info: mem_handle_event: Got an > event OC_EV_MS_NEW_MEMBERSHIP from ccm > Nov 30 13:40:03 node0 crmd: [24367]: info: mem_handle_event: > instance=4, > nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3 > Nov 30 13:40:03 node0 crmd: [24367]: info: crmd_ccm_msg_callback: > Quorum > (re)attained after event=NEW MEMBERSHIP (id=4) > Nov 30 13:40:03 node0 crmd: [24367]: info: ccm_event_detail: NEW > MEMBERSHIP: trans=4, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, > old_idx=3 > Nov 30 13:40:03 node0 crmd: [24367]: info: ccm_event_detail: > CURRENT: > node0 [nodeid=0, born=4] > Nov 30 13:40:03 node0 crmd: [24367]: info: ccm_event_detail: LOST: > node1 [nodeid=1, born=3] > Nov 30 13:40:04 node0 cib: [24362]: info: cib_diff_notify: Local-only > Change (client:24367, call: 69): 0.2.51 (ok) > Nov 30 13:40:04 node0 tengine: [25399]: info: te_update_diff: > Processing > diff (cib_update): 0.2.51 -> 0.2.51 > Nov 30 13:40:04 node0 cib: [907]: info: write_cib_contents: Wrote > version 0.2.51 of the CIB to disk (digest: > a0c6f0b8d31bfac96182e1bc2d02cad0) > Nov 30 13:40:04 node0 cib: [908]: info: write_cib_contents: Wrote > version 0.2.51 of the CIB to disk (digest: > 66b77c6223da9fcedcadcbac04fd0d67) > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
