Hi all,

I'm trying to learn pacemaker (still) using a pair of RHEL 7 beta VMs. I've got stonith configured and it technically works (crashed node reboots), but pacemaker hangs...

Here is the config:

====
Cluster Name: rhel7-pcmk
Corosync Nodes:
 rhel7-01.alteeve.ca rhel7-02.alteeve.ca
Pacemaker Nodes:
 rhel7-01.alteeve.ca rhel7-02.alteeve.ca

Resources:

Stonith Devices:
 Resource: fence_n01_virsh (class=stonith type=fence_virsh)
Attributes: pcmk_host_list=rhel7-01 ipaddr=lemass action=reboot login=root passwd_script=/root/lemass.pw delay=15 port=rhel7_01
  Operations: monitor interval=60s (fence_n01_virsh-monitor-interval-60s)
 Resource: fence_n02_virsh (class=stonith type=fence_virsh)
Attributes: pcmk_host_list=rhel7-02 ipaddr=lemass action=reboot login=root passwd_script=/root/lemass.pw port=rhel7_02
  Operations: monitor interval=60s (fence_n02_virsh-monitor-interval-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 cluster-infrastructure: corosync
 dc-version: 1.1.10-19.el7-368c726
 no-quorum-policy: ignore
 stonith-enabled: true
====

Here are the logs:

====
Dec 21 14:36:07 rhel7-01 corosync[1709]: [TOTEM ] A processor failed, forming new configuration. Dec 21 14:36:09 rhel7-01 corosync[1709]: [TOTEM ] A new membership (192.168.122.101:24) was formed. Members left: 2
Dec 21 14:36:09 rhel7-01 corosync[1709]: [QUORUM] Members[1]: 1
Dec 21 14:36:09 rhel7-01 corosync[1709]: [MAIN ] Completed service synchronization, ready to provide service. Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: crm_update_peer_state: pcmk_quorum_notification: Node rhel7-02.alteeve.ca[2] - state is now lost (was member) Dec 21 14:36:09 rhel7-01 crmd[1730]: warning: reap_dead_nodes: Our DC node (rhel7-02.alteeve.ca) left the cluster Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=reap_dead_nodes ] Dec 21 14:36:09 rhel7-01 pacemakerd[1724]: notice: crm_update_peer_state: pcmk_quorum_notification: Node rhel7-02.alteeve.ca[2] - state is now lost (was member) Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] Dec 21 14:36:10 rhel7-01 attrd[1728]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Dec 21 14:36:10 rhel7-01 attrd[1728]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Dec 21 14:36:11 rhel7-01 pengine[1729]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: pe_fence_node: Node rhel7-02.alteeve.ca will be fenced because fence_n02_virsh is thought to be active there Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: custom_action: Action fence_n02_virsh_stop_0 on rhel7-02.alteeve.ca is unrunnable (offline) Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: stage6: Scheduling Node rhel7-02.alteeve.ca for STONITH Dec 21 14:36:11 rhel7-01 pengine[1729]: notice: LogActions: Move fence_n02_virsh (Started rhel7-02.alteeve.ca -> rhel7-01.alteeve.ca) Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: process_pe_message: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-warn-2.bz2 Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: te_fence_node: Executing reboot fencing operation (11) on rhel7-02.alteeve.ca (timeout=60000) Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice: handle_request: Client crmd.1730.4f6ea9e1 wants to fence (reboot) 'rhel7-02.alteeve.ca' with device '(any)' Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for rhel7-02.alteeve.ca: ea720bbf-aeab-43bb-a196-3a4c091dea75 (0) Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice: can_fence_host_with_device: fence_n01_virsh can not fence rhel7-02.alteeve.ca: static-list Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice: can_fence_host_with_device: fence_n02_virsh can not fence rhel7-02.alteeve.ca: static-list Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: error: remote_op_done: Operation reboot of rhel7-02.alteeve.ca by rhel7-01.alteeve.ca for [email protected]: No such device Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: tengine_stonith_callback: Stonith operation 2/11:0:0:52e1fdf2-0b3a-42be-b7df-4d9dadb8d98b: No such device (-19) Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: tengine_stonith_callback: Stonith operation 2 for rhel7-02.alteeve.ca failed (No such device): aborting transition. Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: tengine_stonith_notify: Peer rhel7-02.alteeve.ca was not terminated (reboot) by rhel7-01.alteeve.ca for rhel7-01.alteeve.ca: No such device (ref=ea720bbf-aeab-43bb-a196-3a4c091dea75) by client crmd.1730 Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: run_graph: Transition 0 (Complete=1, Pending=0, Fired=0, Skipped=5, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: too_many_st_failures: No devices found in cluster to fence rhel7-02.alteeve.ca, giving up Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
====

I've tried with the full host names and with the short host names in 'pcmk_host_list=', but the same result both times.

Versions:
====
pacemaker-1.1.10-19.el7.x86_64
pcs-0.9.99-2.el7.x86_64
====

Can someone hit me with a clustick?

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without access to education?

_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to