Re: [Linux-HA] observations after some fencing tests in a two node

Sebastian Reitenbach Thu, 08 Nov 2007 03:28:20 -0800

Hi,
Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> On Wed, Nov 07, 2007 at 04:43:32PM +0100, Sebastian Reitenbach wrote:
> > Hi all,
> > 
> > I did some fencing tests in a two node cluster, here are some details of 
my 
> > setup:
> > 
> > - use stonith external/ilo for fencing (ssh to ilo board and issue a 
reset 
> > command)
> > - both nodes are connected via two bridged ethernet interfaces to two 
> > redundant switches. The ilo boards are connected to the each of the 
> > switches.
> > 
> > My first observation:
> > - when removing the network cables from the node that is the DC at the 
> > moment, it took at least three minutes, until it decided to stonith the 
> > other node and to startup the resources that ran on the node without 
network 
> > connectivity
> > - when removing the network cables from the node that is not the DC, 
then it 
> > was a matter of e.g. 20 seconds, then this node fenced the DC, and then 
> > became DC
> 
> This definitely deserves a set of logs, etc (is your hb_report
> operational? :).
humm, yes, with the latest patches (:
ok, I'll reproduce the problem and create a report.


> 
> > Why is there such a difference? The first one takes too long in my eyes 
to 
> > detect the outage, but I hope there are timeout values that I can tweak. 
For 
> > which ones shall I take a look?
> 
> deadtime in ha.cf.
> 
> > Also I recognized the following line in the logfile from the DC in the 
first 
> > case:
> > tengine: ... info: extract_event: Stonith/shutdown of <uuid> not matched
> > This line shows up immediately after the DC detects that the other node 
is 
> > unreachable. From then it takes at least two minutes until the DC 
decides to 
> > fence the other node.
> 
> Looks like a kind of misunderstanding between the CRM and
> stonithd. Again, a report would hopefully reveal what's going on.
> If you could turn debug on, that'd be great. A bugzilla is
> fine too.
I'll do, with above logs attached.
> 
> > The second thing I observed:
> > My stonith is working via ssh to the ilo board to the node that shall be 
> > fenced. When I remove the ethernet cables from one node, stonith will 
fail 
> > to kill the other node.
> > 
> > take case two from above, remove the cables from the node that is not 
the 
> > DC, where I observed the following:
> > The DC needs about some minutes to decide to fence the other node, 
because 
> > of the above observed behaviour. Meanwhile the non DC node without 
network 
> > cables tried to fence the DC, that failed, and the node was in a unclean 
> > state, until the DC fenced it in the end. 
> > Luckily the stonith of the DC failed, then assume instead of ssh as 
stonith 
> > resource, use a stonith devied connected to e.g. serial port.
> > In that case, the non DC node were able to fence the DC, and then become 
DC 
> > itself, starting all resources, mounting all filesystems, ...
> > Meanwhile the DC is restarted, and either heartbeat is not started 
> > automatically, then the cluster is unusable, because the one node that 
is DC 
> > has no network. Or when heartbeat is started automatically, it cannot 
> > communicate to the second node, and will assume this one is dead,
> 
> and will insist on reseting it. Which would result in a yo-yo
> machinery. Not entirely useful. This kind of lack of
> communication is obviously detrimental, and that in spite of the
> stonith configured. Right now don't see a solution to this issue.
> Apart from pingd.
> 
> > and start 
> > all its resources, so that e.g. filesystems could be mounted on both 
nodes.
> > 
> > I don't have a hardware fencing device to test my theory, but could that 
> > happen or not? Could the usage of some ping nodes, combined with a pingd 
or 
> > an external quorumd help to solve the dilemma?
> 
> A pingd resource with appropriate constraints would help, i.e.
> something like "don't run resources if the pingd attribute is
> zero".
I am already fiddling around with ping, but doesn't seem to get it to work, 
see the other thread: "problem with locations depending on pingd"



> 
> > Well, I am running heartbeat 2.1.2-15 on sles10sp1, any hints and 
comments 
> > are appreciated.
> 
> Thanks,
> 
> Dejan
thank you,

Sebastian

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] observations after some fencing tests in a two node

Reply via email to