On 02/09/13 23:23, Alex Sudakar wrote:
I've got a very simple question which I suspect betrays my lack of
understanding of something basic. Could someone help me understand?
If I have a two-node Pacemaker cluster - say, a really simple cluster
of two nodes, A & B, with a solitary network connection between them -
then I have to set no-quorum-policy to 'ignore'. If the network
connection is broken then both A & B will attempt to STONITH each
other.
Is there anything that would stop an endless cycle of each killing the
other if the actions of the STONITH agents are set to reboot?
I.e.:
- A & B race to STONITH each other
- A kills B
- A assumes resources
- B reboots
- B can't see A
- B kills A
- B assumes resources
- A reboots
- A can't see B
- A kills B
- A assumes resources
... etc.
It's to stop this sort of cycle that I've set my STONITH actions to
'off' rather than 'reboot'.
But I was reading the 'Fencing topology' document that Digimer
referenced and I was reminded in my perusal that many people/clusters
use a 'reboot' action.
For a simple quorum-less cluster of two nodes how do those clusters
avoid a never-ending cycle of each node killing the other, if neither
node can 'see' the other via corosync?
It's a very basic question; I think I'm forgetting something obvious.
Thanks for any help!
There are two parts to this problem;
1. Both call a fence, both might win (dual fence) because both can get
their fence call started before they die. This is because the fence
devices, the IPMI BMCs, are independent.
2. A fence loop is where a fenced node boots, fences the other node. The
other node boots, fences the first. Wash, rinse, repeat.
To solve problem 1, you can set a delay against one of the nodes. Say
you set the fence primitive for node 01 to have 'delay="15"'. When node
1 goes to fence node 2, it starts immediately. When node 2 starts to
fence node 1, it sees the 15 second delay and pauses. Node 1 will power
off node 2 long before node 2 finishes the pause. You can further help
this problem by disabling acpid on the nodes. Without it, the power-off
signal from the BMC will be nearly instant, shortening up the window
where both nodes can initiate a fence.
To solve problem 2, simply disable corosync/pacemaker from starting on
boot. This way, the fenced node will be (hopefully) back up and running,
so you can ssh into it and look at what happened. It won't try to rejoin
the cluster though, so no risk of a fence loop.
digimer
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems