Re: [Linux-HA] Quick 'death match cycle' question.

Digimer Mon, 02 Sep 2013 21:04:45 -0700

On 02/09/13 23:23, Alex Sudakar wrote:

I've got a very simple question which I suspect betrays my lack of
understanding of something basic.  Could someone help me understand?


If I have a two-node Pacemaker cluster - say, a really simple cluster
of two nodes, A & B, with a solitary network connection between them -
then I have to set no-quorum-policy to 'ignore'.  If the network
connection is broken then both A & B will attempt to STONITH each
other.

Is there anything that would stop an endless cycle of each killing the
other if the actions of the STONITH agents are set to reboot?

I.e.:

-  A & B race to STONITH each other
-  A kills B
-  A assumes resources

-  B reboots
-  B can't see A
-  B kills A
-  B assumes resources

-  A reboots
-  A can't see B
-  A kills B
-  A assumes resources

... etc.

It's to stop this sort of cycle that I've set my STONITH actions to
'off' rather than 'reboot'.

But I was reading the 'Fencing topology' document that Digimer
referenced and I was reminded in my perusal that many people/clusters
use a 'reboot' action.

For a simple quorum-less cluster of two nodes how do those clusters
avoid a never-ending cycle of each node killing the other, if neither
node can 'see' the other via corosync?

It's a very basic question; I think I'm forgetting something obvious.
Thanks for any help!


There are two parts to this problem;

1. Both call a fence, both might win (dual fence) because both can gettheir fence call started before they die. This is because the fencedevices, the IPMI BMCs, are independent.

2. A fence loop is where a fenced node boots, fences the other node. Theother node boots, fences the first. Wash, rinse, repeat.

To solve problem 1, you can set a delay against one of the nodes. Sayyou set the fence primitive for node 01 to have 'delay="15"'. When node1 goes to fence node 2, it starts immediately. When node 2 starts tofence node 1, it sees the 15 second delay and pauses. Node 1 will poweroff node 2 long before node 2 finishes the pause. You can further helpthis problem by disabling acpid on the nodes. Without it, the power-offsignal from the BMC will be nearly instant, shortening up the windowwhere both nodes can initiate a fence.

To solve problem 2, simply disable corosync/pacemaker from starting onboot. This way, the fenced node will be (hopefully) back up and running,so you can ssh into it and look at what happened. It won't try to rejointhe cluster though, so no risk of a fence loop.


digimer

--
Digimer
Papers and Projects: https://alteeve.ca/w/

What if the cure for cancer is trapped in the mind of a person withoutaccess to education?

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Quick 'death match cycle' question.

Reply via email to