[Pacemaker] Failure after intermittent network outage

Pavel Levshin Sun, 06 Mar 2011 13:10:48 -0800

Hi everyone.

We have three-node cluster using Pacemaker 1.0.10 on RHEL5.5. Two nodes(wapgw1-1 and wapgw1-2) are configured for DRBD and running virtualmachines over it. Third node (wapgw1-log) is mostly a quorum server,i.e. it has not libvirtd nor DRBD installed. There are locationconstraints which allow resources to run on "real" nodes only.

All three nodes are connected to network over bonded links atactive-backup mode.


STONITH had been configured but unavailable at the moment. It's bad, I know.

The problem come when one of two interfaces on quorum node (wapgw1-log)went down. It was not first time, and previously this did not cause anyharm.


Corosync has lost connectivity, cluster has fallen into partitions.

Mar 1 11:15:58 wapgw1-log corosync[24536]: [TOTEM ] A processorfailed, forming new configuration.Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice:pcmk_peer_update: Transitional membership event on ring 3500: memb=1,new=0, lost=2Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:pcmk_peer_update: memb: wapgw1-log 813454090Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:pcmk_peer_update: lost: wapgw1-1 1098666762Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:pcmk_peer_update: lost: wapgw1-2 1115443978Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice:pcmk_peer_update: Stable membership event on ring 3500: memb=1, new=0,lost=0Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:pcmk_peer_update: MEMB: wapgw1-log 813454090Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:ais_mark_unseen_peer_dead: Node wapgw1-2 was not seen in the previoustransitionMar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:update_member: Node 1115443978/wapgw1-2 is now: lostMar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:ais_mark_unseen_peer_dead: Node wapgw1-1 was not seen in the previoustransitionMar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:update_member: Node 1098666762/wapgw1-1 is now: lostMar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:send_member_notification: Sending membership update 3500 to 2 childrenMar 1 11:15:59 wapgw1-log crmd: [24547]: notice: ais_dispatch:Membership 3500: quorum lostMar 1 11:15:59 wapgw1-log corosync[24536]: [TOTEM ] A processorjoined or left the membership and a new membership was formed.Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: crm_update_peer: Nodewapgw1-2: id=1115443978 state=lost (new) addr=r(0) ip(10.83.124.66)votes=1 born=3400 seen=3496 proc=00000000000000000000000000013312Mar 1 11:15:59 wapgw1-log cib: [24543]: notice: ais_dispatch:Membership 3500: quorum lostMar 1 11:15:59 wapgw1-log crmd: [24547]: info: crm_update_peer: Nodewapgw1-1: id=1098666762 state=lost (new) addr=r(0) ip(10.83.124.65)votes=1 born=3404 seen=3496 proc=00000000000000000000000000013312Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Nodewapgw1-2: id=1115443978 state=lost (new) addr=r(0) ip(10.83.124.66)votes=1 born=3400 seen=3496 proc=00000000000000000000000000013312Mar 1 11:15:59 wapgw1-log crmd: [24547]: WARN: check_dead_member: OurDC node (wapgw1-2) left the clusterMar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition:State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTIONcause=C_FSA_INTERNAL origin=check_dead_member ]Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Nodewapgw1-1: id=1098666762 state=lost (new) addr=r(0) ip(10.83.124.65)votes=1 born=3404 seen=3496 proc=00000000000000000000000000013312

Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: update_dc: Unset DC wapgw1-2

Mar 1 11:15:59 wapgw1-log corosync[24536]: [MAIN ] Completed servicesynchronization, ready to provide service.Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition:State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DCcause=C_FSA_INTERNAL origin=do_election_check ]Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_te_control:Registering TE UUID: 1be865f6-557d-45c4-b549-c10dbab5acc4Mar 1 11:15:59 wapgw1-log crmd: [24547]: WARN:cib_client_add_notify_callback: Callback already presentMar 1 11:15:59 wapgw1-log crmd: [24547]: info: set_graph_functions:Setting custom graph functionsMar 1 11:15:59 wapgw1-log crmd: [24547]: info: unpack_graph: Unpackedtransition -1: 0 actions in 0 synapsesMar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_dc_takeover: Takingover DC status for this partitionMar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_readwrite: Weare now in R/W mode


DC node has noticed member loss:

Mar 1 11:15:59 wapgw1-2 pengine: [5748]: WARN: pe_fence_node: Nodewapgw1-log will be fenced because it is un-expectedly downMar 1 11:15:59 wapgw1-2 pengine: [5748]: info:determine_online_status_fencing: ha_state=active, ccm_state=false,crm_state=online, join_state=member, expected=memberMar 1 11:15:59 wapgw1-2 pengine: [5748]: WARN: determine_online_status:Node wapgw1-log is unclean


But fencing was unavailable, so sad story continues.

Bonding switched to backup link. It took a second. Back on quorum node:

Mar 1 11:15:59 wapgw1-log kernel: bonding: bond0: link statusdefinitely down for interface eth0, disabling itMar 1 11:15:59 wapgw1-log kernel: bonding: bond0: making interface eth1the new active one.Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_request:Operation complete: op cib_modify for section crm_config(origin=local/crmd/169, version=0.797.24): ok (rc=0)Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: crm_ais_dispatch:Setting expected votes to 3Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition:State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATEDcause=C_FSA_INTERNAL origin=check_join_state ]Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: All1 cluster nodes responded to the join offer.Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_dc_join_finalize:join-5: Syncing the CIB from wapgw1-log to the rest of the clusterMar 1 11:15:59 wapgw1-log crmd: [24547]: info: te_connect_stonith:Attempting connection to fencing daemon...Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_request:Operation complete: op cib_modify for section crm_config(origin=local/crmd/172, version=0.797.24): ok (rc=0)Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_request:Operation complete: op cib_sync for section 'all'(origin=local/crmd/173, version=0.797.24): ok (rc=0)Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice:pcmk_peer_update: Transitional membership event on ring 3504: memb=1,new=0, lost=0Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:pcmk_peer_update: memb: wapgw1-log 813454090Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice:pcmk_peer_update: Stable membership event on ring 3504: memb=3, new=2,lost=0Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:update_member: Node 1098666762/wapgw1-1 is now: memberMar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:pcmk_peer_update: NEW: wapgw1-1 1098666762Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:update_member: Node 1115443978/wapgw1-2 is now: memberMar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:pcmk_peer_update: NEW: wapgw1-2 1115443978Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:pcmk_peer_update: MEMB: wapgw1-log 813454090Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:pcmk_peer_update: MEMB: wapgw1-1 1098666762Mar 1 11:15:59 wapgw1-log cib: [24543]: notice: ais_dispatch:Membership 3504: quorum acquiredMar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:pcmk_peer_update: MEMB: wapgw1-2 1115443978Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:send_member_notification: Sending membership update 3504 to 2 childrenMar 1 11:15:59 wapgw1-log corosync[24536]: [TOTEM ] A processorjoined or left the membership and a new membership was formed.Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Nodewapgw1-2: id=1115443978 state=member (new) addr=r(0) ip(10.83.124.66)votes=1 born=3400 seen=3504 proc=00000000000000000000000000013312Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Nodewapgw1-1: id=1098666762 state=member (new) addr=r(0) ip(10.83.124.65)votes=1 born=3404 seen=3504 proc=00000000000000000000000000013312Mar 1 11:15:59 wapgw1-log corosync[24536]: [MAIN ] Completed servicesynchronization, ready to provide service.Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Nodewapgw1-2: id=1115443978 state=member (new) addr=r(0) ip(10.83.124.66)votes=1 born=3400 seen=3504 proc=00000000000000000000000000013312Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Nodewapgw1-1: id=1098666762 state=member (new) addr=r(0) ip(10.83.124.65)votes=1 born=3404 seen=3504 proc=00000000000000000000000000013312Mar 1 11:15:59 wapgw1-log corosync[24536]: [MAIN ] Completed servicesynchronization, ready to provide service.Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_diff: Diff0.797.25 -> 0.797.26 not applied to 0.797.24: current "num_updates" isless than requiredMar 1 11:15:59 wapgw1-log cib: [24543]: WARN: cib_server_process_diff:Not requesting full refresh in slave mode.Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_diff: Diff0.797.26 -> 0.797.27 not applied to 0.797.24: current "num_updates" isless than requiredMar 1 11:15:59 wapgw1-log cib: [24543]: WARN: cib_server_process_diff:Not requesting full refresh in slave mode.Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_diff: Diff0.797.27 -> 0.797.28 not applied to 0.797.24: current "num_updates" isless than requiredMar 1 11:15:59 wapgw1-log cib: [24543]: WARN: cib_server_process_diff:Not requesting full refresh in slave mode.Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: te_connect_stonith:ConnectedMar 1 11:16:00 wapgw1-log crmd: [24547]: notice: ais_dispatch:Membership 3504: quorum acquiredMar 1 11:16:00 wapgw1-log crmd: [24547]: info: ais_status_callback:status: wapgw1-2 is now member (was lost)Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: crm_update_peer: Nodewapgw1-2: id=1115443978 state=member (new) addr=r(0) ip(10.83.124.66)votes=1 born=3400 seen=3504 proc=00000000000000000000000000013312Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: ais_status_callback:status: wapgw1-1 is now member (was lost)Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: crm_update_peer: Nodewapgw1-1: id=1098666762 state=member (new) addr=r(0) ip(10.83.124.65)votes=1 born=3404 seen=3504 proc=00000000000000000000000000013312Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: crm_update_quorum:Updating quorum status to true (call=181)Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:Operation complete: op cib_modify for section nodes(origin=local/crmd/174, version=0.797.24): ok (rc=0)Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:Operation complete: op cib_delete for section//node_state[@uname='wapgw1-2']/lrm (origin=local/crmd/175,version=0.797.25): ok (rc=0)Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:Operation complete: op cib_delete for section//node_state[@uname='wapgw1-2']/transient_attributes(origin=local/crmd/176, version=0.797.26): ok (rc=0)Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:Operation complete: op cib_delete for section//node_state[@uname='wapgw1-1']/lrm (origin=local/crmd/177,version=0.797.27): ok (rc=0)Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:Operation complete: op cib_delete for section//node_state[@uname='wapgw1-1']/transient_attributes(origin=local/crmd/178, version=0.797.28): ok (rc=0)Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:Operation complete: op cib_modify for section nodes(origin=local/crmd/179, version=0.797.28): ok (rc=0)Mar 1 11:16:00 wapgw1-log cib: [24543]: info: log_data_element:cib:diff: - <cib dc-uuid="wapgw1-2" admin_epoch="0" epoch="797"num_updates="28" />Mar 1 11:16:00 wapgw1-log cib: [24543]: info: log_data_element:cib:diff: + <cib dc-uuid="wapgw1-log" admin_epoch="0" epoch="798"num_updates="1" />Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:Operation complete: op cib_modify for section cib(origin=local/crmd/181, version=0.798.1): ok (rc=0)Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: crm_ais_dispatch:Setting expected votes to 3Mar 1 11:16:00 wapgw1-log cib: [24721]: info: write_cib_contents:Archived previous version as /var/lib/heartbeat/crm/cib-57.rawMar 1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback:Deletion of "//node_state[@uname='wapgw1-2']/lrm": ok (rc=0)Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback:Deletion of "//node_state[@uname='wapgw1-2']/transient_attributes": ok(rc=0)Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback:Deletion of "//node_state[@uname='wapgw1-1']/lrm": ok (rc=0)Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback:Deletion of "//node_state[@uname='wapgw1-1']/transient_attributes": ok(rc=0)Mar 1 11:16:00 wapgw1-log cib: [24721]: info: write_cib_contents: Wroteversion 0.798.0 of the CIB to disk (digest:2cb8c5e90f8f3d227335340ddeb1914a)


DC node:

Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] notice:pcmk_peer_update: Transitional membership event on ring 3504: memb=2,new=0, lost=0Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:pcmk_peer_update: memb: wapgw1-1 1098666762Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:pcmk_peer_update: memb: wapgw1-2 1115443978Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] notice:pcmk_peer_update: Stable membership event on ring 3504: memb=3, new=1,lost=0Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info: update_member:Node 813454090/wapgw1-log is now: memberMar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:pcmk_peer_update: NEW: wapgw1-log 813454090Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:pcmk_peer_update: MEMB: wapgw1-log 813454090Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:pcmk_peer_update: MEMB: wapgw1-1 1098666762Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:pcmk_peer_update: MEMB: wapgw1-2 1115443978Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:send_member_notification: Sending membership update 3504 to 2 childrenMar 1 11:15:59 wapgw1-2 corosync[5738]: [TOTEM ] A processor joinedor left the membership and a new membership was formed.Mar 1 11:15:59 wapgw1-2 cib: [5745]: info: ais_dispatch: Membership3504: quorum retainedMar 1 11:15:59 wapgw1-2 crmd: [5749]: info: ais_dispatch: Membership3504: quorum retainedMar 1 11:15:59 wapgw1-2 cib: [5745]: info: crm_update_peer: Nodewapgw1-log: id=813454090 state=member (new) addr=r(0) ip(10.83.124.48)votes=1 born=3464 seen=3504 proc=00000000000000000000000000013312Mar 1 11:15:59 wapgw1-2 crmd: [5749]: info: ais_status_callback:status: wapgw1-log is now member (was lost)Mar 1 11:15:59 wapgw1-2 crmd: [5749]: info: crm_update_peer: Nodewapgw1-log: id=813454090 state=member (new) addr=r(0) ip(10.83.124.48)votes=1 born=3464 seen=3504 proc=00000000000000000000000000013312Mar 1 11:15:59 wapgw1-2 corosync[5738]: [MAIN ] Completed servicesynchronization, ready to provide service.


This time, DC decided to check all resources on rejoined node.

Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: do_state_transition: Statetransition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESScause=C_IPC_MESSAGE origin=handle_response ]Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: unpack_graph: Unpackedtransition 1353: 37 actions in 37 synapsesMar 1 11:16:00 wapgw1-2 crmd: [5749]: info: do_te_invoke: Processinggraph 1353 (ref=pe_calc-dc-1298967360-1592) derived from/var/lib/pengine/pe-input-1525.bz2Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiatingaction 30: monitor ilo-wapgw1-1:0_monitor_0 on wapgw1-logMar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiatingaction 31: monitor ilo-wapgw1-2:0_monitor_0 on wapgw1-logMar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiatingaction 32: monitor ilo-wapgw1-log:0_monitor_0 on wapgw1-logMar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiatingaction 33: monitor p-drbd-mdirect1-1:0_monitor_0 on wapgw1-logMar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiatingaction 34: monitor p-drbd-mdirect1-2:0_monitor_0 on wapgw1-log


And then... 'Suddenly' another DC appeared.

Mar 1 11:16:00 wapgw1-2 crmd: [5749]: ERROR: crmd_ha_msg_filter:Another DC detected: wapgw1-log (op=noop)Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: do_state_transition: Statetransition S_TRANSITION_ENGINE -> S_ELECTION [ input=I_ELECTIONcause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]

Mar  1 11:16:00 wapgw1-2 crmd: [5749]: info: update_dc: Unset DC wapgw1-2

Mar 1 11:16:00 wapgw1-log crmd: [24547]: ERROR: crmd_ha_msg_filter:Another DC detected: wapgw1-2 (op=noop)Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_state_transition:State transition S_FINALIZE_JOIN -> S_ELECTION [ input=I_ELECTIONcause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: update_dc: Unset DCwapgw1-log


All monitor actions had failed like this:

Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_lrm_rsc_op:Performing key=33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cdop=p-drbd-mdirect1-1:0_monitor_0 )Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_lrm_rsc_op:Discarding attempt to perform action monitor on p-drbd-mdirect1-1:0 instate S_ELECTIONMar 1 11:16:00 wapgw1-log crmd: [24547]: info: send_direct_ack: ACK'ingresource op p-drbd-mdirect1-1:0_monitor_0 from33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd:lrm_invoke-lrmd-1298967360-58Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: process_te_message:Processing (N)ACK lrm_invoke-lrmd-1298967360-58 from wapgw1-logMar 1 11:16:00 wapgw1-log crmd: [24547]: info: process_graph_event:Action p-drbd-mdirect1-1:0_monitor_0/33(4:99;33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd) initiated by adifferent transitionerMar 1 11:16:00 wapgw1-log crmd: [24547]: info: abort_transition_graph:process_graph_event:456 - Triggered transition abort (complete=1,tag=lrm_rsc_op, id=p-drbd-mdirect1-1:0_monitor_0,magic=4:99;33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd) : Foreign event


Old DC has won elections.

Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: do_dc_takeover: Taking overDC status for this partition

Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_state_transition:State transition S_ELECTION -> S_RELEASE_DC [ input=I_RELEASE_DCcause=C_FSA_INTERNAL origin=do_election_count_vote ]Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_dc_release: DC rolereleasedMar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_readwrite: Weare now in R/O modeMar 1 11:16:00 wapgw1-log crmd: [24547]: info: pe_connection_destroy:Connection to the Policy Engine releasedMar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_te_control:Transitioner is now inactiveMar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_te_control:Disconnecting STONITH...Mar 1 11:16:00 wapgw1-log crmd: [24547]: info:tengine_stonith_connection_destroy: Fencing daemon disconnected

Mar  1 11:16:00 wapgw1-log crmd: [24547]: notice: Not currently connected.

Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_state_transition:State transition S_RELEASE_DC -> S_PENDING [ input=I_RELEASE_SUCCESScause=C_FSA_INTERNAL origin=do_dc_release ]Mar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback:Update 2849 for probe_complete=true failed: The object/attribute doesnot existMar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback:Update 2919 for last-failure-ilo-wapgw1-2:0=1298967173 failed: Theobject/attribute does not existMar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback:Update 2924 for last-failure-ilo-wapgw1-1:0=1298967173 failed: Theobject/attribute does not existMar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback:Update 2944 for fail-count-ilo-wapgw1-2:0=INFINITY failed: Theobject/attribute does not existMar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback:Update 2947 for fail-count-ilo-wapgw1-1:0=INFINITY failed: Theobject/attribute does not existMar 1 11:16:01 wapgw1-log crmd: [24547]: info: update_dc: Set DC towapgw1-2 (3.0.1)


After a while, DC had remembered all those actions:

Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback:Timer popped (timeout=20000, abort_level=1000000, complete=false)Mar 1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Abortingtransition, action lost: [Action 33]: In-flight (id:p-drbd-mdirect1-1:0_monitor_0, loc: wapgw1-log, priority: 0)Mar 1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph:action_timer_callback:486 - Triggered transition abort (complete=0) :Action lostMar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op33: p-drbd-mdirect1-1:0_monitor_0 on wapgw1-log timed out

...

Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:Processing failed op ilo-wapgw1-log:0_monitor_0 on wapgw1-log: unknownerror (1)Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:Processing failed op p-drbd-mproxy1-2:0_monitor_0 on wapgw1-log: unknownerror (1)Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:Processing failed op p-drbd-mdirect1-1:0_monitor_0 on wapgw1-log:unknown error (1)Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:Processing failed op p-drbd-mrouter1-1:0_monitor_0 on wapgw1-log:unknown error (1)Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:Processing failed op vm-mdirect1-1_monitor_0 on wapgw1-log: unknownerror (1)Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:Processing failed op vm-mproxy1-1_monitor_0 on wapgw1-log: unknown error (1)Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:Processing failed op vm-mdirect1-2_monitor_0 on wapgw1-log: unknownerror (1)

...

Mar 1 11:17:21 wapgw1-2 pengine: [5748]: ERROR: native_add_running:Resource ocf::VirtualDomain:vm-mdirect1-1 appears to be active on 2 nodes.Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: Seehttp://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.Mar 1 11:17:21 wapgw1-2 pengine: [5748]: ERROR: native_add_running:Resource ocf::VirtualDomain:vm-mproxy1-1 appears to be active on 2 nodes.Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: Seehttp://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.Mar 1 11:17:21 wapgw1-2 pengine: [5748]: ERROR: native_add_running:Resource ocf::VirtualDomain:vm-mrouter1-1 appears to be active on 2 nodes.Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: Seehttp://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.

...

So it tried to stop all resources. And it succeeded on real nodes, butnot on quorum node:

Mar 1 11:17:21 wapgw1-log VirtualDomain[24851]: ERROR: Setup problem:Couldn't find utility virshMar 1 11:17:21 wapgw1-log VirtualDomain[24852]: ERROR: Setup problem:Couldn't find utility virshMar 1 11:17:21 wapgw1-log VirtualDomain[24853]: ERROR: Setup problem:Couldn't find utility virshMar 1 11:17:21 wapgw1-log VirtualDomain[24850]: ERROR: Setup problem:Couldn't find utility virshMar 1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRMoperation vm-mdirect1-2_stop_0 (call=44, rc=5, cib-update=188,confirmed=true) not installedMar 1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRMoperation vm-mproxy1-1_stop_0 (call=45, rc=5, cib-update=189,confirmed=true) not installedMar 1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRMoperation vm-mproxy1-2_stop_0 (call=46, rc=5, cib-update=190,confirmed=true) not installedMar 1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRMoperation vm-mdirect1-1_stop_0 (call=43, rc=5, cib-update=191,confirmed=true) not installed

Then the cluster is stuck. Resources cannot be stopped on quorum node asthere are no such resources. All resources on real nodes are down.Pengine believes the resources are started on quorum node.

Mar 1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print:vm-mdirect1-1 (ocf::heartbeat:VirtualDomain): Started wapgw1-logFAILEDMar 1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print:vm-mdirect1-2 (ocf::heartbeat:VirtualDomain): Started wapgw1-logMar 1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print:vm-mproxy1-1 (ocf::heartbeat:VirtualDomain): Started wapgw1-logMar 1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print:vm-mproxy1-2 (ocf::heartbeat:VirtualDomain): Started wapgw1-log


End of story. I have full hb_report, so let me know if it may help.

I understand that quorum node could be fenced to resolve this situation.But fencing by reset is not a very good thing, particularly, for somenon-clustered services running on this node. Moreover, fencing was notreally needed, as these services cannot ever run on the node. Thecluster would continue fine if the quorum node could confirm that noresources run on it.


Was there something that Pacemaker could do better? For example:

1) Retry failed monitor actions somehow;

2) Maybe accept "not installed" on stop like a confirmation. I thinkthat would be dangerous, though, if the node has failed in some screwed way;3) Or could there be configuration knob saying "the resource do notexists on the node"? So the cluster don't try to run nonexistentresources when this is impossible.

What is the right way to avoid cluster split when bonding switches? Ihave adjusted 'token' directive in corosync.conf to 5000, is this OK?


Thank you for you time.

--
Pavel Levshin //flicker


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

[Pacemaker] Failure after intermittent network outage

Reply via email to