Hi all,

I did some fencing tests in a two node cluster, here are some details of my 
setup:

- use stonith external/ilo for fencing (ssh to ilo board and issue a reset 
command)
- both nodes are connected via two bridged ethernet interfaces to two 
redundant switches. The ilo boards are connected to the each of the 
switches.

My first observation:
- when removing the network cables from the node that is the DC at the 
moment, it took at least three minutes, until it decided to stonith the 
other node and to startup the resources that ran on the node without network 
connectivity
- when removing the network cables from the node that is not the DC, then it 
was a matter of e.g. 20 seconds, then this node fenced the DC, and then 
became DC

Why is there such a difference? The first one takes too long in my eyes to 
detect the outage, but I hope there are timeout values that I can tweak. For 
which ones shall I take a look?

Also I recognized the following line in the logfile from the DC in the first 
case:
tengine: ... info: extract_event: Stonith/shutdown of <uuid> not matched
This line shows up immediately after the DC detects that the other node is 
unreachable. From then it takes at least two minutes until the DC decides to 
fence the other node.


The second thing I observed:
My stonith is working via ssh to the ilo board to the node that shall be 
fenced. When I remove the ethernet cables from one node, stonith will fail 
to kill the other node.

take case two from above, remove the cables from the node that is not the 
DC, where I observed the following:
The DC needs about some minutes to decide to fence the other node, because 
of the above observed behaviour. Meanwhile the non DC node without network 
cables tried to fence the DC, that failed, and the node was in a unclean 
state, until the DC fenced it in the end. 
Luckily the stonith of the DC failed, then assume instead of ssh as stonith 
resource, use a stonith devied connected to e.g. serial port.
In that case, the non DC node were able to fence the DC, and then become DC 
itself, starting all resources, mounting all filesystems, ...
Meanwhile the DC is restarted, and either heartbeat is not started 
automatically, then the cluster is unusable, because the one node that is DC 
has no network. Or when heartbeat is started automatically, it cannot 
communicate to the second node, and will assume this one is dead, and start 
all its resources, so that e.g. filesystems could be mounted on both nodes.

I don't have a hardware fencing device to test my theory, but could that 
happen or not? Could the usage of some ping nodes, combined with a pingd or 
an external quorumd help to solve the dilemma?

Well, I am running heartbeat 2.1.2-15 on sles10sp1, any hints and comments 
are appreciated.

kind regards
Sebastian



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to