Hi,

On Mon, Dec 03, 2007 at 02:33:19PM -0500, Graham, Simon wrote:
> I've been seeing an occasional problem where resources are not restarted
> when the node running them is powered off; in the specific case in
> question, I have a 2-node cluster (with STONITH), the resources are

The only thing I could imagine happened was that the surviving
node tried to stonith the other, but couldn't in which case it'll
wait forever.

> running on one node (node1) and the DC is on the other node (node0) --
> after power cycling node1, I see the attached in the ha log file and
> then nothing else until about 3 minutes later when the other node comes
> back.

Can you please give us full logs? For the whole 3 minutes.

> Any pointers to what to look at would be appreciated... it seems to me
> that it is related the order in which the various events are processed
> but I cant quite follow the code...

As usual, look for errors in the logs. Hmm, don't try to follow
the code, it could take a while...

Thanks,

Dejan

> 
> Thanks,
> Simon
> 
> Nov 30 13:40:03 node0 heartbeat: [19665]: WARN: node node1: is dead
> Nov 30 13:40:03 node0 heartbeat: [19665]: info: Link node1:priv0 dead.
> Nov 30 13:40:03 node0 heartbeat: [19665]: info: Link node1:biz0 dead.
> Nov 30 13:40:03 node0 crmd: [24367]: notice: crmd_ha_status_callback:
> Status update: Node node1 now has status [dead]
> Nov 30 13:40:03 node0 cib: [24362]: info: cib_diff_notify: Local-only
> Change (client:24367, call: 66): 0.2.51 (ok)
> Nov 30 13:40:03 node0 ccm: [24360]: debug: quorum plugin: majority
> Nov 30 13:40:03 node0 ccm: [24360]: debug: cluster:linux-ha,
> member_count=1, member_quorum_votes=100
> Nov 30 13:40:03 node0 ccm: [24360]: debug: total_node_count=2,
> total_quorum_votes=200
> Nov 30 13:40:03 node0 ccm: [24360]: debug: quorum plugin: twonodes
> Nov 30 13:40:03 node0 ccm: [24360]: debug: cluster:linux-ha,
> member_count=1, member_quorum_votes=100
> Nov 30 13:40:03 node0 ccm: [24360]: debug: total_node_count=2,
> total_quorum_votes=200
> Nov 30 13:40:03 node0 ccm: [24360]: info: Break tie for 2 nodes cluster
> Nov 30 13:40:03 node0 tengine: [25399]: info: te_update_diff: Processing
> diff (cib_update): 0.2.51 -> 0.2.51
> Nov 30 13:40:03 node0 tengine: [25399]: WARN: match_down_event: No match
> for shutdown action on 44454c4c-3700-104e-8035-b5c04f504431
> Nov 30 13:40:03 node0 crmd: [24367]: info: mem_handle_event: Got an
> event OC_EV_MS_INVALID from ccm
> Nov 30 13:40:03 node0 tengine: [25399]: info: extract_event:
> Stonith/shutdown of 44454c4c-3700-104e-8035-b5c04f504431 not matched
> Nov 30 13:40:03 node0 crmd: [24367]: info: mem_handle_event: no
> mbr_track info
> Nov 30 13:40:03 node0 tengine: [25399]: info: update_abort_priority:
> Abort priority upgraded to 1000000
> Nov 30 13:40:03 node0 crmd: [24367]: info: do_state_transition: node0:
> State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_IPC_MESSAGE origin=route_message ]
> Nov 30 13:40:03 node0 tengine: [25399]: info: te_update_diff: Aborting
> on transient_attributes deletions
> Nov 30 13:40:03 node0 crmd: [24367]: info: do_state_transition: All 2
> cluster nodes are eligible to run resources.
> Nov 30 13:40:03 node0 cib: [24362]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Nov 30 13:40:03 node0 cib: [24362]: info: mem_handle_event: no mbr_track
> info
> Nov 30 13:40:03 node0 cib: [24362]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Nov 30 13:40:03 node0 cib: [24362]: info: mem_handle_event: instance=4,
> nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
> Nov 30 13:40:03 node0 cib: [24362]: info: cib_ccm_msg_callback: LOST:
> node1
> Nov 30 13:40:03 node0 cib: [24362]: info: cib_ccm_msg_callback: PEER:
> node0
> Nov 30 13:40:03 node0 crmd: [24367]: info: do_pe_invoke_callback:
> Waiting for another CCM event before proceeding: CIB=4 > CRM=3

Never saw this message before. 

> Nov 30 13:40:03 node0 crmd: [24367]: info: mem_handle_event: Got an
> event OC_EV_MS_NEW_MEMBERSHIP from ccm

Perhaps this is another event crmd waited for.

> Nov 30 13:40:03 node0 crmd: [24367]: info: mem_handle_event: instance=4,
> nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
> Nov 30 13:40:03 node0 crmd: [24367]: info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=4)
> Nov 30 13:40:03 node0 crmd: [24367]: info: ccm_event_detail: NEW
> MEMBERSHIP: trans=4, nodes=1, new=0, lost=1 n_idx=0, new_idx=1,
> old_idx=3
> Nov 30 13:40:03 node0 crmd: [24367]: info: ccm_event_detail:  CURRENT:
> node0 [nodeid=0, born=4]
> Nov 30 13:40:03 node0 crmd: [24367]: info: ccm_event_detail:  LOST:
> node1 [nodeid=1, born=3]
> Nov 30 13:40:04 node0 cib: [24362]: info: cib_diff_notify: Local-only
> Change (client:24367, call: 69): 0.2.51 (ok)
> Nov 30 13:40:04 node0 tengine: [25399]: info: te_update_diff: Processing
> diff (cib_update): 0.2.51 -> 0.2.51
> Nov 30 13:40:04 node0 cib: [907]: info: write_cib_contents: Wrote
> version 0.2.51 of the CIB to disk (digest:
> a0c6f0b8d31bfac96182e1bc2d02cad0)
> Nov 30 13:40:04 node0 cib: [908]: info: write_cib_contents: Wrote
> version 0.2.51 of the CIB to disk (digest:
> 66b77c6223da9fcedcadcbac04fd0d67)
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to