Hi, On Wed, Nov 07, 2007 at 04:43:32PM +0100, Sebastian Reitenbach wrote: > Hi all, > > I did some fencing tests in a two node cluster, here are some details of my > setup: > > - use stonith external/ilo for fencing (ssh to ilo board and issue a reset > command) > - both nodes are connected via two bridged ethernet interfaces to two > redundant switches. The ilo boards are connected to the each of the > switches. > > My first observation: > - when removing the network cables from the node that is the DC at the > moment, it took at least three minutes, until it decided to stonith the > other node and to startup the resources that ran on the node without network > connectivity > - when removing the network cables from the node that is not the DC, then it > was a matter of e.g. 20 seconds, then this node fenced the DC, and then > became DC
This definitely deserves a set of logs, etc (is your hb_report operational? :). > Why is there such a difference? The first one takes too long in my eyes to > detect the outage, but I hope there are timeout values that I can tweak. For > which ones shall I take a look? deadtime in ha.cf. > Also I recognized the following line in the logfile from the DC in the first > case: > tengine: ... info: extract_event: Stonith/shutdown of <uuid> not matched > This line shows up immediately after the DC detects that the other node is > unreachable. From then it takes at least two minutes until the DC decides to > fence the other node. Looks like a kind of misunderstanding between the CRM and stonithd. Again, a report would hopefully reveal what's going on. If you could turn debug on, that'd be great. A bugzilla is fine too. > The second thing I observed: > My stonith is working via ssh to the ilo board to the node that shall be > fenced. When I remove the ethernet cables from one node, stonith will fail > to kill the other node. > > take case two from above, remove the cables from the node that is not the > DC, where I observed the following: > The DC needs about some minutes to decide to fence the other node, because > of the above observed behaviour. Meanwhile the non DC node without network > cables tried to fence the DC, that failed, and the node was in a unclean > state, until the DC fenced it in the end. > Luckily the stonith of the DC failed, then assume instead of ssh as stonith > resource, use a stonith devied connected to e.g. serial port. > In that case, the non DC node were able to fence the DC, and then become DC > itself, starting all resources, mounting all filesystems, ... > Meanwhile the DC is restarted, and either heartbeat is not started > automatically, then the cluster is unusable, because the one node that is DC > has no network. Or when heartbeat is started automatically, it cannot > communicate to the second node, and will assume this one is dead, and will insist on reseting it. Which would result in a yo-yo machinery. Not entirely useful. This kind of lack of communication is obviously detrimental, and that in spite of the stonith configured. Right now don't see a solution to this issue. Apart from pingd. > and start > all its resources, so that e.g. filesystems could be mounted on both nodes. > > I don't have a hardware fencing device to test my theory, but could that > happen or not? Could the usage of some ping nodes, combined with a pingd or > an external quorumd help to solve the dilemma? A pingd resource with appropriate constraints would help, i.e. something like "don't run resources if the pingd attribute is zero". > Well, I am running heartbeat 2.1.2-15 on sles10sp1, any hints and comments > are appreciated. Thanks, Dejan > kind regards > Sebastian > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
