Re: [Pacemaker] Recovery after lost quorum

Denis Witt Tue, 04 Jun 2013 16:28:24 -0700

Am 05.06.2013 um 00:52 schrieb Andrew Beekhof <and...@beekhof.net>:

>> been restored the resources aren't restarted. Running crm_resource -P
>> brings anything up, but of course it would be nice if this happens
>> automatically. Is there any way to archive this?
> 
> It should happen automatically.
> Logs?


Hi Andrew,

thanks for your reply.

Here are the logs:

Jun  5 01:11:03 test4 pengine: [18625]: WARN: cluster_status: We do not have 
quorum - fencing and resource management disabled
Jun  5 01:11:03 test4 pengine: [18625]: notice: LogActions: Start   
pingtest:0#011(test4 - blocked)
Jun  5 01:11:03 test4 pengine: [18625]: notice: LogActions: Start   
drbd:0#011(test4 - blocked)
Jun  5 01:11:03 test4 crmd: [18626]: notice: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Jun  5 01:11:03 test4 crmd: [18626]: info: do_te_invoke: Processing graph 9 
(ref=pe_calc-dc-1370387463-81) derived from /var/lib/pengine/pe-input-108.bz2
Jun  5 01:11:03 test4 crmd: [18626]: notice: run_graph: ==== Transition 9 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pengine/pe-input-108.bz2): Complete
Jun  5 01:11:03 test4 crmd: [18626]: notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun  5 01:11:03 test4 pengine: [18625]: notice: process_pe_message: Transition 
9: PEngine Input stored in: /var/lib/pengine/pe-input-108.bz2
Jun  5 01:11:06 test4 attrd: [18624]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: pingd (<null>)
Jun  5 01:11:06 test4 attrd: [18624]: notice: attrd_perform_update: Sent delete 
33: node=test4, attr=pingd, id=<n/a>, set=(null), section=status
Jun  5 01:11:06 test4 crmd: [18626]: info: abort_transition_graph: 
te_update_diff:194 - Triggered transition abort (complete=1, 
tag=transient_attributes, id=test4, magic=NA, cib=0.43.98) : Transient 
attribute: removal
Jun  5 01:11:06 test4 crmd: [18626]: notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Jun  5 01:11:06 test4 pengine: [18625]: WARN: cluster_status: We do not have 
quorum - fencing and resource management disabled
Jun  5 01:11:06 test4 pengine: [18625]: notice: LogActions: Start   
pingtest:0#011(test4 - blocked)
Jun  5 01:11:06 test4 pengine: [18625]: notice: LogActions: Start   
drbd:0#011(test4 - blocked)
Jun  5 01:11:06 test4 crmd: [18626]: notice: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Jun  5 01:11:06 test4 crmd: [18626]: info: do_te_invoke: Processing graph 10 
(ref=pe_calc-dc-1370387466-82) derived from /var/lib/pengine/pe-input-109.bz2
Jun  5 01:11:06 test4 crmd: [18626]: notice: run_graph: ==== Transition 10 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pengine/pe-input-109.bz2): Complete
Jun  5 01:11:06 test4 crmd: [18626]: notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun  5 01:11:06 test4 pengine: [18625]: notice: process_pe_message: Transition 
10: PEngine Input stored in: /var/lib/pengine/pe-input-109.bz2
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 128: memb=1, new=0, lost=0
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] info: pcmk_peer_update: memb: 
test4 1835313344
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 128: memb=2, new=1, lost=0
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] info: update_member: Node 
526690496/backup3 is now: member
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] info: pcmk_peer_update: NEW:  
backup3 526690496
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
backup3 526690496
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
test4 1835313344
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] info: 
send_member_notification: Sending membership update 128 to 2 children
Jun  5 01:11:11 test4 corosync[18598]:   [TOTEM ] A processor joined or left 
the membership and a new membership was formed.
Jun  5 01:11:11 test4 crmd: [18626]: notice: ais_dispatch_message: Membership 
128: quorum acquired
Jun  5 01:11:11 test4 crmd: [18626]: info: ais_status_callback: status: backup3 
is now member (was lost)
Jun  5 01:11:11 test4 crmd: [18626]: WARN: match_down_event: No match for 
shutdown action on backup3
Jun  5 01:11:11 test4 cib: [18621]: notice: ais_dispatch_message: Membership 
128: quorum acquired
Jun  5 01:11:11 test4 crmd: [18626]: info: erase_status_tag: Deleting xpath: 
//node_state[@uname='backup3']/lrm
Jun  5 01:11:11 test4 cib: [18621]: info: crm_update_peer: Node backup3: 
id=526690496 state=member (new) addr=r(0) ip(192.168.100.31)  votes=1 born=120 
seen=128 proc=00000000000000000000000000111312
Jun  5 01:11:11 test4 crmd: [18626]: info: erase_status_tag: Deleting xpath: 
//node_state[@uname='backup3']/transient_attributes
Jun  5 01:11:11 test4 cib: [18621]: info: cib_process_request: Operation 
complete: op cib_delete for section //node_state[@uname='backup3']/lrm 
(origin=local/crmd/113, version=0.43.99): ok (rc=0)
Jun  5 01:11:11 test4 crmd: [18626]: info: crm_update_peer: Node backup3: 
id=526690496 state=member (new) addr=r(0) ip(192.168.100.31)  votes=1 born=120 
seen=128 proc=00000000000000000000000000111312
Jun  5 01:11:11 test4 cib: [18621]: info: cib_process_request: Operation 
complete: op cib_delete for section 
//node_state[@uname='backup3']/transient_attributes (origin=local/crmd/114, 
version=0.43.100): ok (rc=0)
Jun  5 01:11:11 test4 cib: [18621]: info: cib_process_request: Operation 
complete: op cib_modify for section nodes (origin=local/crmd/115, 
version=0.43.101): ok (rc=0)
Jun  5 01:11:11 test4 cib: [18621]: info: cib_process_request: Operation 
complete: op cib_modify for section cib (origin=local/crmd/117, 
version=0.43.103): ok (rc=0)
Jun  5 01:11:11 test4 crmd: [18626]: info: crmd_ais_dispatch: Setting expected 
votes to 3
Jun  5 01:11:11 test4 cib: [18621]: info: cib_process_request: Operation 
complete: op cib_modify for section crm_config (origin=local/crmd/119, 
version=0.43.104): ok (rc=0)
Jun  5 01:11:11 test4 crmd: [18626]: info: abort_transition_graph: 
te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, 
id=drbd:0_last_failure_0, magic=0:5;6:2:7:ca6fd6dd-bf3c-4b75-9214-26e3a36b2f1c, 
cib=0.43.99) : Resource op removal
Jun  5 01:11:11 test4 crmd: [18626]: info: abort_transition_graph: 
te_update_diff:194 - Triggered transition abort (complete=1, 
tag=transient_attributes, id=backup3, magic=NA, cib=0.43.100) : Transient 
attribute: removal
Jun  5 01:11:11 test4 corosync[18598]:   [CPG   ] chosen downlist: sender r(0) 
ip(192.168.100.109) ; members(old:1 left:0)
Jun  5 01:11:11 test4 crmd: [18626]: notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Jun  5 01:11:11 test4 crmd: [18626]: info: do_state_transition: Membership 
changed: 120 -> 128 - join restart
Jun  5 01:11:11 test4 crmd: [18626]: notice: do_state_transition: State 
transition S_POLICY_ENGINE -> S_INTEGRATION [ input=I_NODE_JOIN 
cause=C_FSA_INTERNAL origin=do_state_transition ]
Jun  5 01:11:11 test4 corosync[18598]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Jun  5 01:11:11 test4 crmd: [18626]: info: join_make_offer: Making join offers 
based on membership 128
Jun  5 01:11:11 test4 crmd: [18626]: info: do_dc_join_offer_all: join-4: 
Waiting on 2 outstanding join acks
Jun  5 01:11:11 test4 crmd: [18626]: info: update_dc: Set DC to test4 (3.0.6)

Please notice that at the moment there are only two of the three nodes online, 
but quorum is established, as expected. Both nodes are running corosync and 
pacemaker, but the second node didn't have any of the configured resources (so 
i got "not installed" errors there, usually pacemaker is disabled on this 
node). The resources aren't started as well if pacemaker is disabled on this 
node (only corosync).

analysis.txt from hb_report states:

Log patterns:
Jun  5 01:14:11 test4 crmd: [18626]: ERROR: crm_timer_popped: Integration Timer 
(I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms)

My config:

node backup3 \
        attributes standby="off"
node test3
node test4
primitive apache lsb:apache2 \
        op monitor interval="10" timeout="20" \
        meta target-role="Started"
primitive drbd ocf:linbit:drbd \
        params drbd_resource="www_r0" \
        op monitor interval="10"
primitive fs_drbd ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" directory="/var/www" fstype="ext4" \
        op monitor interval="5" \
        meta target-role="Started"
primitive pingtest ocf:pacemaker:ping \
        params multiplier="1000" host_list="192.168.100.19" \
        op monitor interval="5"
primitive sip ocf:heartbeat:IPaddr2 \
        params ip="192.168.100.30" nic="eth0" \
        op monitor interval="10" timeout="20" \
        meta target-role="Started"
group grp_all sip fs_drbd apache
ms ms_drbd drbd \
        meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true"
clone clone_pingtest pingtest
location loc_all_on_best_ping grp_all \
        rule $id="loc_all_on_best_ping-rule" -inf: not_defined pingd or pingd 
lt 1000
colocation coloc_all_on_drbd inf: grp_all ms_drbd:Master
order order_all_after_drbd inf: ms_drbd:promote grp_all:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="3" \
        no-quorum-policy="stop" \
        stonith-enabled="false" \
        last-lrm-refresh="1370360692" \
        default-resource-stickiness="100" \
        maintenance-mode="false"

Best regards,
Denis Witt
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Recovery after lost quorum

Reply via email to