Re: [Linux-HA] observations after some fencing tests in a two node

Andrew Beekhof Fri, 09 Nov 2007 08:23:52 -0800


On Nov 7, 2007, at 4:43 PM, Sebastian Reitenbach wrote:

Hi all,
I did some fencing tests in a two node cluster, here are somedetails of my
setup:
- use stonith external/ilo for fencing (ssh to ilo board and issue areset
command)
- both nodes are connected via two bridged ethernet interfaces to two
redundant switches. The ilo boards are connected to the each of the
switches.

My first observation:
- when removing the network cables from the node that is the DC at the
moment, it took at least three minutes, until it decided to stoniththeother node and to startup the resources that ran on the node withoutnetwork
connectivity
- when removing the network cables from the node that is not the DC,then itwas a matter of e.g. 20 seconds, then this node fenced the DC, andthen
became DC
Why is there such a difference? The first one takes too long in myeyes todetect the outage, but I hope there are timeout values that I cantweak. For
which ones shall I take a look?

I see later on you said you can't reproduce this, but I'd really liketo see that logs if you still have them.Also, hb_report and be used after you find a problem - it's notnecessary to be able to reproduce it.

Also I recognized the following line in the logfile from the DC inthe first
case:
tengine: ... info: extract_event: Stonith/shutdown of <uuid> notmatchedThis line shows up immediately after the DC detects that the othernode is
unreachable.


Thats the TE noticing the node go away - which is good

From then it takes at least two minutes until the DC decides to
fence the other node.


This part - not so good.

The second thing I observed:
My stonith is working via ssh to the ilo board to the node thatshall befenced. When I remove the ethernet cables from one node, stonithwill fail
to kill the other node.
take case two from above, remove the cables from the node that isnot the
DC, where I observed the following:
The DC needs about some minutes to decide to fence the other node,becauseof the above observed behaviour. Meanwhile the non DC node withoutnetworkcables tried to fence the DC, that failed, and the node was in aunclean
state, until the DC fenced it in the end.
Luckily the stonith of the DC failed, then assume instead of ssh asstonith
resource, use a stonith devied connected to e.g. serial port.
In that case, the non DC node were able to fence the DC, and thenbecome DC
itself, starting all resources, mounting all filesystems, ...
Meanwhile the DC is restarted, and either heartbeat is not started
automatically, then the cluster is unusable, because the one nodethat is DC
has no network. Or when heartbeat is started automatically, it cannot
communicate to the second node, and will assume this one is dead,


Actually it wont assume that.

Instead it will try to shoot the other node and only after thatsucceeds will it start any resources.

Safe but not very smart (since clearly each side will take turnsshooting the other until the fault is repaired).


Which is why 2 node clusters are not a very good idea :-)

In a 3-node cluster the disconnected node wont have quorum and isn'tallowed to try and kill anyone.


Alternatively, use stonith-action=poweroff

and start
all its resources, so that e.g. filesystems could be mounted on bothnodes.
I don't have a hardware fencing device to test my theory, but couldthathappen or not? Could the usage of some ping nodes, combined with apingd or
an external quorumd help to solve the dilemma?
Well, I am running heartbeat 2.1.2-15 on sles10sp1, any hints andcomments
are appreciated.

kind regards
Sebastian



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] observations after some fencing tests in a two node

Reply via email to