Hi everyone.

We have three-node cluster using Pacemaker 1.0.10 on RHEL5.5. Two nodes (wapgw1-1 and wapgw1-2) are configured for DRBD and running virtual machines over it. Third node (wapgw1-log) is mostly a quorum server, i.e. it has not libvirtd nor DRBD installed. There are location constraints which allow resources to run on "real" nodes only.

All three nodes are connected to network over bonded links at active-backup mode.

STONITH had been configured but unavailable at the moment. It's bad, I know.

The problem come when one of two interfaces on quorum node (wapgw1-log) went down. It was not first time, and previously this did not cause any harm.

Corosync has lost connectivity, cluster has fallen into partitions.

Mar 1 11:15:58 wapgw1-log corosync[24536]: [TOTEM ] A processor failed, forming new configuration. Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 3500: memb=1, new=0, lost=2 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: memb: wapgw1-log 813454090 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: lost: wapgw1-1 1098666762 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: lost: wapgw1-2 1115443978 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 3500: memb=1, new=0, lost=0 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: MEMB: wapgw1-log 813454090 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: ais_mark_unseen_peer_dead: Node wapgw1-2 was not seen in the previous transition Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: update_member: Node 1115443978/wapgw1-2 is now: lost Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: ais_mark_unseen_peer_dead: Node wapgw1-1 was not seen in the previous transition Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: update_member: Node 1098666762/wapgw1-1 is now: lost Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: send_member_notification: Sending membership update 3500 to 2 children Mar 1 11:15:59 wapgw1-log crmd: [24547]: notice: ais_dispatch: Membership 3500: quorum lost Mar 1 11:15:59 wapgw1-log corosync[24536]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: crm_update_peer: Node wapgw1-2: id=1115443978 state=lost (new) addr=r(0) ip(10.83.124.66) votes=1 born=3400 seen=3496 proc=00000000000000000000000000013312 Mar 1 11:15:59 wapgw1-log cib: [24543]: notice: ais_dispatch: Membership 3500: quorum lost Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: crm_update_peer: Node wapgw1-1: id=1098666762 state=lost (new) addr=r(0) ip(10.83.124.65) votes=1 born=3404 seen=3496 proc=00000000000000000000000000013312 Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node wapgw1-2: id=1115443978 state=lost (new) addr=r(0) ip(10.83.124.66) votes=1 born=3400 seen=3496 proc=00000000000000000000000000013312 Mar 1 11:15:59 wapgw1-log crmd: [24547]: WARN: check_dead_member: Our DC node (wapgw1-2) left the cluster Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ] Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node wapgw1-1: id=1098666762 state=lost (new) addr=r(0) ip(10.83.124.65) votes=1 born=3404 seen=3496 proc=00000000000000000000000000013312
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: update_dc: Unset DC wapgw1-2
Mar 1 11:15:59 wapgw1-log corosync[24536]: [MAIN ] Completed service synchronization, ready to provide service. Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_te_control: Registering TE UUID: 1be865f6-557d-45c4-b549-c10dbab5acc4 Mar 1 11:15:59 wapgw1-log crmd: [24547]: WARN: cib_client_add_notify_callback: Callback already present Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: set_graph_functions: Setting custom graph functions Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: unpack_graph: Unpacked transition -1: 0 actions in 0 synapses Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_dc_takeover: Taking over DC status for this partition Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_readwrite: We are now in R/W mode

DC node has noticed member loss:

Mar 1 11:15:59 wapgw1-2 pengine: [5748]: WARN: pe_fence_node: Node wapgw1-log will be fenced because it is un-expectedly down Mar 1 11:15:59 wapgw1-2 pengine: [5748]: info: determine_online_status_fencing: ha_state=active, ccm_state=false, crm_state=online, join_state=member, expected=member Mar 1 11:15:59 wapgw1-2 pengine: [5748]: WARN: determine_online_status: Node wapgw1-log is unclean

But fencing was unavailable, so sad story continues.

Bonding switched to backup link. It took a second. Back on quorum node:

Mar 1 11:15:59 wapgw1-log kernel: bonding: bond0: link status definitely down for interface eth0, disabling it Mar 1 11:15:59 wapgw1-log kernel: bonding: bond0: making interface eth1 the new active one. Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/169, version=0.797.24): ok (rc=0) Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: crm_ais_dispatch: Setting expected votes to 3 Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ] Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: All 1 cluster nodes responded to the join offer. Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_dc_join_finalize: join-5: Syncing the CIB from wapgw1-log to the rest of the cluster Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: te_connect_stonith: Attempting connection to fencing daemon... Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/172, version=0.797.24): ok (rc=0) Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/173, version=0.797.24): ok (rc=0) Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 3504: memb=1, new=0, lost=0 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: memb: wapgw1-log 813454090 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 3504: memb=3, new=2, lost=0 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: update_member: Node 1098666762/wapgw1-1 is now: member Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: NEW: wapgw1-1 1098666762 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: update_member: Node 1115443978/wapgw1-2 is now: member Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: NEW: wapgw1-2 1115443978 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: MEMB: wapgw1-log 813454090 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: MEMB: wapgw1-1 1098666762 Mar 1 11:15:59 wapgw1-log cib: [24543]: notice: ais_dispatch: Membership 3504: quorum acquired Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: MEMB: wapgw1-2 1115443978 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: send_member_notification: Sending membership update 3504 to 2 children Mar 1 11:15:59 wapgw1-log corosync[24536]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node wapgw1-2: id=1115443978 state=member (new) addr=r(0) ip(10.83.124.66) votes=1 born=3400 seen=3504 proc=00000000000000000000000000013312 Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node wapgw1-1: id=1098666762 state=member (new) addr=r(0) ip(10.83.124.65) votes=1 born=3404 seen=3504 proc=00000000000000000000000000013312 Mar 1 11:15:59 wapgw1-log corosync[24536]: [MAIN ] Completed service synchronization, ready to provide service. Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node wapgw1-2: id=1115443978 state=member (new) addr=r(0) ip(10.83.124.66) votes=1 born=3400 seen=3504 proc=00000000000000000000000000013312 Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node wapgw1-1: id=1098666762 state=member (new) addr=r(0) ip(10.83.124.65) votes=1 born=3404 seen=3504 proc=00000000000000000000000000013312 Mar 1 11:15:59 wapgw1-log corosync[24536]: [MAIN ] Completed service synchronization, ready to provide service. Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_diff: Diff 0.797.25 -> 0.797.26 not applied to 0.797.24: current "num_updates" is less than required Mar 1 11:15:59 wapgw1-log cib: [24543]: WARN: cib_server_process_diff: Not requesting full refresh in slave mode. Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_diff: Diff 0.797.26 -> 0.797.27 not applied to 0.797.24: current "num_updates" is less than required Mar 1 11:15:59 wapgw1-log cib: [24543]: WARN: cib_server_process_diff: Not requesting full refresh in slave mode. Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_diff: Diff 0.797.27 -> 0.797.28 not applied to 0.797.24: current "num_updates" is less than required Mar 1 11:15:59 wapgw1-log cib: [24543]: WARN: cib_server_process_diff: Not requesting full refresh in slave mode. Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: te_connect_stonith: Connected Mar 1 11:16:00 wapgw1-log crmd: [24547]: notice: ais_dispatch: Membership 3504: quorum acquired Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: ais_status_callback: status: wapgw1-2 is now member (was lost) Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: crm_update_peer: Node wapgw1-2: id=1115443978 state=member (new) addr=r(0) ip(10.83.124.66) votes=1 born=3400 seen=3504 proc=00000000000000000000000000013312 Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: ais_status_callback: status: wapgw1-1 is now member (was lost) Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: crm_update_peer: Node wapgw1-1: id=1098666762 state=member (new) addr=r(0) ip(10.83.124.65) votes=1 born=3404 seen=3504 proc=00000000000000000000000000013312 Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: crm_update_quorum: Updating quorum status to true (call=181) Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/174, version=0.797.24): ok (rc=0) Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='wapgw1-2']/lrm (origin=local/crmd/175, version=0.797.25): ok (rc=0) Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='wapgw1-2']/transient_attributes (origin=local/crmd/176, version=0.797.26): ok (rc=0) Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='wapgw1-1']/lrm (origin=local/crmd/177, version=0.797.27): ok (rc=0) Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='wapgw1-1']/transient_attributes (origin=local/crmd/178, version=0.797.28): ok (rc=0) Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/179, version=0.797.28): ok (rc=0) Mar 1 11:16:00 wapgw1-log cib: [24543]: info: log_data_element: cib:diff: - <cib dc-uuid="wapgw1-2" admin_epoch="0" epoch="797" num_updates="28" /> Mar 1 11:16:00 wapgw1-log cib: [24543]: info: log_data_element: cib:diff: + <cib dc-uuid="wapgw1-log" admin_epoch="0" epoch="798" num_updates="1" /> Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/181, version=0.798.1): ok (rc=0) Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: crm_ais_dispatch: Setting expected votes to 3 Mar 1 11:16:00 wapgw1-log cib: [24721]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-57.raw Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback: Deletion of "//node_state[@uname='wapgw1-2']/lrm": ok (rc=0) Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback: Deletion of "//node_state[@uname='wapgw1-2']/transient_attributes": ok (rc=0) Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback: Deletion of "//node_state[@uname='wapgw1-1']/lrm": ok (rc=0) Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback: Deletion of "//node_state[@uname='wapgw1-1']/transient_attributes": ok (rc=0) Mar 1 11:16:00 wapgw1-log cib: [24721]: info: write_cib_contents: Wrote version 0.798.0 of the CIB to disk (digest: 2cb8c5e90f8f3d227335340ddeb1914a)

DC node:

Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 3504: memb=2, new=0, lost=0 Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info: pcmk_peer_update: memb: wapgw1-1 1098666762 Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info: pcmk_peer_update: memb: wapgw1-2 1115443978 Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 3504: memb=3, new=1, lost=0 Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info: update_member: Node 813454090/wapgw1-log is now: member Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info: pcmk_peer_update: NEW: wapgw1-log 813454090 Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info: pcmk_peer_update: MEMB: wapgw1-log 813454090 Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info: pcmk_peer_update: MEMB: wapgw1-1 1098666762 Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info: pcmk_peer_update: MEMB: wapgw1-2 1115443978 Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info: send_member_notification: Sending membership update 3504 to 2 children Mar 1 11:15:59 wapgw1-2 corosync[5738]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Mar 1 11:15:59 wapgw1-2 cib: [5745]: info: ais_dispatch: Membership 3504: quorum retained Mar 1 11:15:59 wapgw1-2 crmd: [5749]: info: ais_dispatch: Membership 3504: quorum retained Mar 1 11:15:59 wapgw1-2 cib: [5745]: info: crm_update_peer: Node wapgw1-log: id=813454090 state=member (new) addr=r(0) ip(10.83.124.48) votes=1 born=3464 seen=3504 proc=00000000000000000000000000013312 Mar 1 11:15:59 wapgw1-2 crmd: [5749]: info: ais_status_callback: status: wapgw1-log is now member (was lost) Mar 1 11:15:59 wapgw1-2 crmd: [5749]: info: crm_update_peer: Node wapgw1-log: id=813454090 state=member (new) addr=r(0) ip(10.83.124.48) votes=1 born=3464 seen=3504 proc=00000000000000000000000000013312 Mar 1 11:15:59 wapgw1-2 corosync[5738]: [MAIN ] Completed service synchronization, ready to provide service.

This time, DC decided to check all resources on rejoined node.

Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: unpack_graph: Unpacked transition 1353: 37 actions in 37 synapses Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: do_te_invoke: Processing graph 1353 (ref=pe_calc-dc-1298967360-1592) derived from /var/lib/pengine/pe-input-1525.bz2 Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating action 30: monitor ilo-wapgw1-1:0_monitor_0 on wapgw1-log Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating action 31: monitor ilo-wapgw1-2:0_monitor_0 on wapgw1-log Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating action 32: monitor ilo-wapgw1-log:0_monitor_0 on wapgw1-log Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating action 33: monitor p-drbd-mdirect1-1:0_monitor_0 on wapgw1-log Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating action 34: monitor p-drbd-mdirect1-2:0_monitor_0 on wapgw1-log

And then... 'Suddenly' another DC appeared.

Mar 1 11:16:00 wapgw1-2 crmd: [5749]: ERROR: crmd_ha_msg_filter: Another DC detected: wapgw1-log (op=noop) Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
Mar  1 11:16:00 wapgw1-2 crmd: [5749]: info: update_dc: Unset DC wapgw1-2

Mar 1 11:16:00 wapgw1-log crmd: [24547]: ERROR: crmd_ha_msg_filter: Another DC detected: wapgw1-2 (op=noop) Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ] Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: update_dc: Unset DC wapgw1-log

All monitor actions had failed like this:

Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_lrm_rsc_op: Performing key=33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd op=p-drbd-mdirect1-1:0_monitor_0 ) Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_lrm_rsc_op: Discarding attempt to perform action monitor on p-drbd-mdirect1-1:0 in state S_ELECTION Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: send_direct_ack: ACK'ing resource op p-drbd-mdirect1-1:0_monitor_0 from 33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd: lrm_invoke-lrmd-1298967360-58 Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: process_te_message: Processing (N)ACK lrm_invoke-lrmd-1298967360-58 from wapgw1-log Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: process_graph_event: Action p-drbd-mdirect1-1:0_monitor_0/33 (4:99;33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd) initiated by a different transitioner Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: abort_transition_graph: process_graph_event:456 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p-drbd-mdirect1-1:0_monitor_0, magic=4:99;33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd) : Foreign event

Old DC has won elections.

Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: do_dc_takeover: Taking over DC status for this partition

Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_state_transition: State transition S_ELECTION -> S_RELEASE_DC [ input=I_RELEASE_DC cause=C_FSA_INTERNAL origin=do_election_count_vote ] Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_dc_release: DC role released Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_readwrite: We are now in R/O mode Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: pe_connection_destroy: Connection to the Policy Engine released Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_te_control: Transitioner is now inactive Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_te_control: Disconnecting STONITH... Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: tengine_stonith_connection_destroy: Fencing daemon disconnected
Mar  1 11:16:00 wapgw1-log crmd: [24547]: notice: Not currently connected.
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_state_transition: State transition S_RELEASE_DC -> S_PENDING [ input=I_RELEASE_SUCCESS cause=C_FSA_INTERNAL origin=do_dc_release ] Mar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback: Update 2849 for probe_complete=true failed: The object/attribute does not exist Mar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback: Update 2919 for last-failure-ilo-wapgw1-2:0=1298967173 failed: The object/attribute does not exist Mar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback: Update 2924 for last-failure-ilo-wapgw1-1:0=1298967173 failed: The object/attribute does not exist Mar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback: Update 2944 for fail-count-ilo-wapgw1-2:0=INFINITY failed: The object/attribute does not exist Mar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback: Update 2947 for fail-count-ilo-wapgw1-1:0=INFINITY failed: The object/attribute does not exist Mar 1 11:16:01 wapgw1-log crmd: [24547]: info: update_dc: Set DC to wapgw1-2 (3.0.1)

After a while, DC had remembered all those actions:

Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: Timer popped (timeout=20000, abort_level=1000000, complete=false) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting transition, action lost: [Action 33]: In-flight (id: p-drbd-mdirect1-1:0_monitor_0, loc: wapgw1-log, priority: 0) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 33: p-drbd-mdirect1-1:0_monitor_0 on wapgw1-log timed out
...
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: Processing failed op ilo-wapgw1-log:0_monitor_0 on wapgw1-log: unknown error (1) Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: Processing failed op p-drbd-mproxy1-2:0_monitor_0 on wapgw1-log: unknown error (1) Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: Processing failed op p-drbd-mdirect1-1:0_monitor_0 on wapgw1-log: unknown error (1) Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: Processing failed op p-drbd-mrouter1-1:0_monitor_0 on wapgw1-log: unknown error (1) Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: Processing failed op vm-mdirect1-1_monitor_0 on wapgw1-log: unknown error (1) Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: Processing failed op vm-mproxy1-1_monitor_0 on wapgw1-log: unknown error (1) Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: Processing failed op vm-mdirect1-2_monitor_0 on wapgw1-log: unknown error (1)
...
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: ERROR: native_add_running: Resource ocf::VirtualDomain:vm-mdirect1-1 appears to be active on 2 nodes. Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information. Mar 1 11:17:21 wapgw1-2 pengine: [5748]: ERROR: native_add_running: Resource ocf::VirtualDomain:vm-mproxy1-1 appears to be active on 2 nodes. Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information. Mar 1 11:17:21 wapgw1-2 pengine: [5748]: ERROR: native_add_running: Resource ocf::VirtualDomain:vm-mrouter1-1 appears to be active on 2 nodes. Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
...

So it tried to stop all resources. And it succeeded on real nodes, but not on quorum node:

Mar 1 11:17:21 wapgw1-log VirtualDomain[24851]: ERROR: Setup problem: Couldn't find utility virsh Mar 1 11:17:21 wapgw1-log VirtualDomain[24852]: ERROR: Setup problem: Couldn't find utility virsh Mar 1 11:17:21 wapgw1-log VirtualDomain[24853]: ERROR: Setup problem: Couldn't find utility virsh Mar 1 11:17:21 wapgw1-log VirtualDomain[24850]: ERROR: Setup problem: Couldn't find utility virsh Mar 1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRM operation vm-mdirect1-2_stop_0 (call=44, rc=5, cib-update=188, confirmed=true) not installed Mar 1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRM operation vm-mproxy1-1_stop_0 (call=45, rc=5, cib-update=189, confirmed=true) not installed Mar 1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRM operation vm-mproxy1-2_stop_0 (call=46, rc=5, cib-update=190, confirmed=true) not installed Mar 1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRM operation vm-mdirect1-1_stop_0 (call=43, rc=5, cib-update=191, confirmed=true) not installed

Then the cluster is stuck. Resources cannot be stopped on quorum node as there are no such resources. All resources on real nodes are down. Pengine believes the resources are started on quorum node.

Mar 1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print: vm-mdirect1-1 (ocf::heartbeat:VirtualDomain): Started wapgw1-log FAILED Mar 1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print: vm-mdirect1-2 (ocf::heartbeat:VirtualDomain): Started wapgw1-log Mar 1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print: vm-mproxy1-1 (ocf::heartbeat:VirtualDomain): Started wapgw1-log Mar 1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print: vm-mproxy1-2 (ocf::heartbeat:VirtualDomain): Started wapgw1-log

End of story. I have full hb_report, so let me know if it may help.

I understand that quorum node could be fenced to resolve this situation. But fencing by reset is not a very good thing, particularly, for some non-clustered services running on this node. Moreover, fencing was not really needed, as these services cannot ever run on the node. The cluster would continue fine if the quorum node could confirm that no resources run on it.

Was there something that Pacemaker could do better? For example:

1) Retry failed monitor actions somehow;
2) Maybe accept "not installed" on stop like a confirmation. I think that would be dangerous, though, if the node has failed in some screwed way; 3) Or could there be configuration knob saying "the resource do not exists on the node"? So the cluster don't try to run nonexistent resources when this is impossible.

What is the right way to avoid cluster split when bonding switches? I have adjusted 'token' directive in corosync.conf to 5000, is this OK?

Thank you for you time.

--
Pavel Levshin //flicker


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to