[Pacemaker] catch-22: can't fence node A because node A has the fencing resource

Brian J. Murrell Mon, 02 Dec 2013 12:57:56 -0800

So, I'm migrating my working pacemaker configuration from 1.1.7 to
1.1.10 and am finding what appears to be a new behavior in 1.1.10.


If a given node is running a fencing resource and that node goes AWOL,
it needs to be fenced (of course).  But any other node trying to take
over the fencing resource to fence it appears to first want the current
owner of the fencing resource to fence the node.  Of course that can't
happen since the node that needs to do the fencing is AWOL.

So while I can buy into the general policy that a node needs to be
fenced in order to take over it's resources, fencing resources have to
be excepted from this or there can be this catch-22.

I believe that is how things were working in 1.1.7 but now that I'm on
1.1.10[-1.el6_4.4] this no longer seems to be the case.

Or perhaps there is some additional configuration that 1.1.10 needs to
effect this behavior.  Here is my configuration:

Cluster Name: 
Corosync Nodes:
 
Pacemaker Nodes:
 host1 host2 

Resources: 
 Resource: rsc1 (class=ocf provider=foo type=Target)
  Attributes: target=111bad0a-a86a-40e3-b056-c5c93168aa0d 
  Meta Attrs: target-role=Started 
  Operations: monitor interval=5 timeout=60 (rsc1-monitor-5)
              start interval=0 timeout=300 (rsc1-start-0)
              stop interval=0 timeout=300 (rsc1-stop-0)
 Resource: rsc2 (class=ocf provider=chroma type=Target)
  Attributes: target=a8efa349-4c73-4efc-90d3-d6be7d73c515 
  Meta Attrs: target-role=Started 
  Operations: monitor interval=5 timeout=60 (rsc2-monitor-5)
              start interval=0 timeout=300 (rsc2-start-0)
              stop interval=0 timeout=300 (rsc2-stop-0)

Stonith Devices: 
 Resource: st-fencing (class=stonith type=fence_foo)
Fencing Levels: 

Location Constraints:
  Resource: rsc1
    Enabled on: host1 (score:20) (id:rsc1-primary)
    Enabled on: host2 (score:10) (id:rsc1-secondary)
  Resource: rsc2
    Enabled on: host2 (score:20) (id:rsc2-primary)
    Enabled on: host1 (score:10) (id:rsc2-secondary)
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 cluster-infrastructure: classic openais (with plugin)
 dc-version: 1.1.10-1.el6_4.4-368c726
 expected-quorum-votes: 2
 no-quorum-policy: ignore
 stonith-enabled: true
 symmetric-cluster: true

One thing that PCS is not showing that might be relevant here is that I
have a a resource stickiness value set to 1000 to prevent resources from
failing back to nodes after a failover.

With the above configuration if host1 is shut down, host2 just spins in
a loop doing:

Dec  2 20:00:02 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will 
be fenced because the node is no longer part of the cluster
Dec  2 20:00:02 host2 pengine[8923]:  warning: determine_online_status: Node 
host1 is unclean
Dec  2 20:00:02 host2 pengine[8923]:  warning: custom_action: Action 
st-fencing_stop_0 on host1 is unrunnable (offline)
Dec  2 20:00:02 host2 pengine[8923]:  warning: custom_action: Action 
rsc1_stop_0 on host1 is unrunnable (offline)
Dec  2 20:00:02 host2 pengine[8923]:  warning: stage6: Scheduling Node host1 
for STONITH
Dec  2 20:00:02 host2 pengine[8923]:   notice: LogActions: Move    
st-fencing#011(Started host1 -> host2)
Dec  2 20:00:02 host2 pengine[8923]:   notice: LogActions: Move    
rsc1#011(Started host1 -> host2)
Dec  2 20:00:02 host2 crmd[8924]:   notice: te_fence_node: Executing reboot 
fencing operation (13) on host1 (timeout=60000)
Dec  2 20:00:02 host2 stonith-ng[8920]:   notice: handle_request: Client 
crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
Dec  2 20:00:02 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for host1: 
ad69ead5-0bbb-45d8-bb07-30bcd405ace2 (0)
Dec  2 20:00:02 host2 pengine[8923]:  warning: process_pe_message: Calculated 
Transition 22: /var/lib/pacemaker/pengine/pe-warn-2.bz2  
Dec  2 20:01:14 host2 stonith-ng[8920]:    error: remote_op_done: Operation 
reboot of host1 by host2 for crmd.8924@host2.ad69ead5: Timer expired
Dec  2 20:01:14 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith 
operation 4/13:22:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired (-62)
Dec  2 20:01:14 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith 
operation 4 for host1 failed (Timer expired): aborting transition.
Dec  2 20:01:14 host2 crmd[8924]:   notice: tengine_stonith_notify: Peer host1 
was not terminated (reboot) by host2 for host2: Timer expired 
(ref=ad69ead5-0bbb-45d8-bb07-30bcd405ace2) by client crmd.8924
Dec  2 20:01:14 host2 crmd[8924]:   notice: run_graph: Transition 22 
(Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Dec  2 20:01:14 host2 pengine[8923]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Dec  2 20:01:14 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will 
be fenced because the node is no longer part of the cluster  
Dec  2 20:01:14 host2 pengine[8923]:  warning: determine_online_status: Node 
host1 is unclean
Dec  2 20:01:14 host2 pengine[8923]:  warning: custom_action: Action 
st-fencing_stop_0 on host1 is unrunnable (offline)
Dec  2 20:01:14 host2 pengine[8923]:  warning: custom_action: Action 
rsc1_stop_0 on host1 is unrunnable (offline)  
Dec  2 20:01:14 host2 pengine[8923]:  warning: stage6: Scheduling Node host1 
for STONITH
Dec  2 20:01:14 host2 pengine[8923]:   notice: LogActions: Move    
st-fencing#011(Started host1 -> host2)
Dec  2 20:01:14 host2 pengine[8923]:   notice: LogActions: Move    
rsc1#011(Started host1 -> host2)
Dec  2 20:01:14 host2 pengine[8923]:  warning: process_pe_message: Calculated 
Transition 23: /var/lib/pacemaker/pengine/pe-warn-2.bz2
Dec  2 20:01:14 host2 crmd[8924]:   notice: te_fence_node: Executing reboot 
fencing operation (13) on host1 (timeout=60000)  
Dec  2 20:01:14 host2 stonith-ng[8920]:   notice: handle_request: Client 
crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
Dec  2 20:01:14 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for host1: 
4c3f947b-12a7-4b6f-84a9-c5ddcbeb55c6 (0)
Dec  2 20:02:26 host2 stonith-ng[8920]:    error: remote_op_done: Operation 
reboot of host1 by host2 for crmd.8924@host2.4c3f947b: Timer expired
Dec  2 20:02:26 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith 
operation 5/13:23:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired (-62)
Dec  2 20:02:26 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith 
operation 5 for host1 failed (Timer expired): aborting transition.  
Dec  2 20:02:26 host2 crmd[8924]:   notice: tengine_stonith_notify: Peer host1 
was not terminated (reboot) by host2 for host2: Timer expired 
(ref=4c3f947b-12a7-4b6f-84a9-c5ddcbeb55c6) by client crmd.8924  
Dec  2 20:02:26 host2 crmd[8924]:   notice: run_graph: Transition 23 
(Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Dec  2 20:02:26 host2 pengine[8923]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Dec  2 20:02:26 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will 
be fenced because the node is no longer part of the cluster
Dec  2 20:02:26 host2 pengine[8923]:  warning: determine_online_status: Node 
host1 is unclean
Dec  2 20:02:26 host2 pengine[8923]:  warning: custom_action: Action 
st-fencing_stop_0 on host1 is unrunnable (offline)
Dec  2 20:02:26 host2 pengine[8923]:  warning: custom_action: Action 
rsc1_stop_0 on host1 is unrunnable (offline)
Dec  2 20:02:26 host2 pengine[8923]:  warning: stage6: Scheduling Node host1 
for STONITH
Dec  2 20:02:26 host2 pengine[8923]:   notice: LogActions: Move    
st-fencing#011(Started host1 -> host2)
Dec  2 20:02:26 host2 pengine[8923]:   notice: LogActions: Move    
rsc1#011(Started host1 -> host2)
Dec  2 20:02:26 host2 crmd[8924]:   notice: te_fence_node: Executing reboot 
fencing operation (13) on host1 (timeout=60000)
Dec  2 20:02:26 host2 stonith-ng[8920]:   notice: handle_request: Client 
crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
Dec  2 20:02:26 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for host1: 
4b9c1ffc-3029-4b6a-8128-63c05f0ef8de (0)
Dec  2 20:02:26 host2 pengine[8923]:  warning: process_pe_message: Calculated 
Transition 24: /var/lib/pacemaker/pengine/pe-warn-2.bz2
Dec  2 20:03:38 host2 stonith-ng[8920]:    error: remote_op_done: Operation 
reboot of host1 by host2 for crmd.8924@host2.4b9c1ffc: Timer expired
Dec  2 20:03:38 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith 
operation 6/13:24:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired (-62)  
Dec  2 20:03:38 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith 
operation 6 for host1 failed (Timer expired): aborting transition.
Dec  2 20:03:38 host2 crmd[8924]:   notice: tengine_stonith_notify: Peer host1 
was not terminated (reboot) by host2 for host2: Timer expired 
(ref=4b9c1ffc-3029-4b6a-8128-63c05f0ef8de) by client crmd.8924
Dec  2 20:03:38 host2 crmd[8924]:   notice: run_graph: Transition 24 
(Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Dec  2 20:03:38 host2 pengine[8923]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Dec  2 20:03:38 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will 
be fenced because the node is no longer part of the cluster
Dec  2 20:03:38 host2 pengine[8923]:  warning: determine_online_status: Node 
host1 is unclean
Dec  2 20:03:38 host2 pengine[8923]:  warning: custom_action: Action 
st-fencing_stop_0 on host1 is unrunnable (offline)
Dec  2 20:03:38 host2 pengine[8923]:  warning: custom_action: Action 
rsc1_stop_0 on host1 is unrunnable (offline)
Dec  2 20:03:38 host2 pengine[8923]:  warning: stage6: Scheduling Node host1 
for STONITH
Dec  2 20:03:38 host2 pengine[8923]:   notice: LogActions: Move    
st-fencing#011(Started host1 -> host2)
Dec  2 20:03:38 host2 pengine[8923]:   notice: LogActions: Move    
rsc1#011(Started host1 -> host2)
Dec  2 20:03:38 host2 crmd[8924]:   notice: te_fence_node: Executing reboot 
fencing operation (13) on host1 (timeout=60000)  
Dec  2 20:03:38 host2 stonith-ng[8920]:   notice: handle_request: Client 
crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
Dec  2 20:03:38 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for host1: 
8200c15c-d138-4b0a-b6df-ac6fe6e46ef1 (0)
Dec  2 20:03:38 host2 pengine[8923]:  warning: process_pe_message: Calculated 
Transition 25: /var/lib/pacemaker/pengine/pe-warn-2.bz2
Dec  2 20:04:50 host2 stonith-ng[8920]:    error: remote_op_done: Operation 
reboot of host1 by host2 for crmd.8924@host2.8200c15c: Timer expired
Dec  2 20:04:50 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith 
operation 7/13:25:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired (-62)
Dec  2 20:04:50 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith 
operation 7 for host1 failed (Timer expired): aborting transition.
Dec  2 20:04:50 host2 crmd[8924]:   notice: tengine_stonith_notify: Peer host1 
was not terminated (reboot) by host2 for host2: Timer expired 
(ref=8200c15c-d138-4b0a-b6df-ac6fe6e46ef1) by client crmd.8924
Dec  2 20:04:50 host2 crmd[8924]:   notice: run_graph: Transition 25 
(Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Dec  2 20:04:50 host2 pengine[8923]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Dec  2 20:04:50 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will 
be fenced because the node is no longer part of the cluster
Dec  2 20:04:50 host2 pengine[8923]:  warning: determine_online_status: Node 
host1 is unclean
Dec  2 20:04:50 host2 pengine[8923]:  warning: custom_action: Action 
st-fencing_stop_0 on host1 is unrunnable (offline)
Dec  2 20:04:50 host2 pengine[8923]:  warning: custom_action: Action 
rsc1_stop_0 on host1 is unrunnable (offline)
Dec  2 20:04:50 host2 pengine[8923]:  warning: stage6: Scheduling Node host1 
for STONITH
Dec  2 20:04:50 host2 pengine[8923]:   notice: LogActions: Move    
st-fencing#011(Started host1 -> host2)
Dec  2 20:04:50 host2 pengine[8923]:   notice: LogActions: Move    
rsc1#011(Started host1 -> host2)
Dec  2 20:04:50 host2 pengine[8923]:  warning: process_pe_message: Calculated 
Transition 26: /var/lib/pacemaker/pengine/pe-warn-2.bz2  
Dec  2 20:04:50 host2 crmd[8924]:   notice: te_fence_node: Executing reboot 
fencing operation (13) on host1 (timeout=60000)
Dec  2 20:04:50 host2 stonith-ng[8920]:   notice: handle_request: Client 
crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
Dec  2 20:04:50 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for host1: 
8ceabae8-6876-4d6d-b44c-c64c0863f68c (0)

So is there something new about 1.1.10 that I am missing?

Cheers,
b.

signature.asc
Description: This is a digitally signed message part

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] catch-22: can't fence node A because node A has the fencing resource

Reply via email to