Hi, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > Hi, > > On Wed, Nov 07, 2007 at 04:43:32PM +0100, Sebastian Reitenbach wrote: > > Hi all, > > > > I did some fencing tests in a two node cluster, here are some details of my > > setup: > > > > - use stonith external/ilo for fencing (ssh to ilo board and issue a reset > > command) > > - both nodes are connected via two bridged ethernet interfaces to two > > redundant switches. The ilo boards are connected to the each of the > > switches. > > > > My first observation: > > - when removing the network cables from the node that is the DC at the > > moment, it took at least three minutes, until it decided to stonith the > > other node and to startup the resources that ran on the node without network > > connectivity > > - when removing the network cables from the node that is not the DC, then it > > was a matter of e.g. 20 seconds, then this node fenced the DC, and then > > became DC > > This definitely deserves a set of logs, etc (is your hb_report > operational? :). humm, yes, with the latest patches (: ok, I'll reproduce the problem and create a report.
> > > Why is there such a difference? The first one takes too long in my eyes to > > detect the outage, but I hope there are timeout values that I can tweak. For > > which ones shall I take a look? > > deadtime in ha.cf. > > > Also I recognized the following line in the logfile from the DC in the first > > case: > > tengine: ... info: extract_event: Stonith/shutdown of <uuid> not matched > > This line shows up immediately after the DC detects that the other node is > > unreachable. From then it takes at least two minutes until the DC decides to > > fence the other node. > > Looks like a kind of misunderstanding between the CRM and > stonithd. Again, a report would hopefully reveal what's going on. > If you could turn debug on, that'd be great. A bugzilla is > fine too. I'll do, with above logs attached. > > > The second thing I observed: > > My stonith is working via ssh to the ilo board to the node that shall be > > fenced. When I remove the ethernet cables from one node, stonith will fail > > to kill the other node. > > > > take case two from above, remove the cables from the node that is not the > > DC, where I observed the following: > > The DC needs about some minutes to decide to fence the other node, because > > of the above observed behaviour. Meanwhile the non DC node without network > > cables tried to fence the DC, that failed, and the node was in a unclean > > state, until the DC fenced it in the end. > > Luckily the stonith of the DC failed, then assume instead of ssh as stonith > > resource, use a stonith devied connected to e.g. serial port. > > In that case, the non DC node were able to fence the DC, and then become DC > > itself, starting all resources, mounting all filesystems, ... > > Meanwhile the DC is restarted, and either heartbeat is not started > > automatically, then the cluster is unusable, because the one node that is DC > > has no network. Or when heartbeat is started automatically, it cannot > > communicate to the second node, and will assume this one is dead, > > and will insist on reseting it. Which would result in a yo-yo > machinery. Not entirely useful. This kind of lack of > communication is obviously detrimental, and that in spite of the > stonith configured. Right now don't see a solution to this issue. > Apart from pingd. > > > and start > > all its resources, so that e.g. filesystems could be mounted on both nodes. > > > > I don't have a hardware fencing device to test my theory, but could that > > happen or not? Could the usage of some ping nodes, combined with a pingd or > > an external quorumd help to solve the dilemma? > > A pingd resource with appropriate constraints would help, i.e. > something like "don't run resources if the pingd attribute is > zero". I am already fiddling around with ping, but doesn't seem to get it to work, see the other thread: "problem with locations depending on pingd" > > > Well, I am running heartbeat 2.1.2-15 on sles10sp1, any hints and comments > > are appreciated. > > Thanks, > > Dejan thank you, Sebastian _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
