Hi,

On Wed, Nov 07, 2007 at 04:43:32PM +0100, Sebastian Reitenbach wrote:
> Hi all,
> 
> I did some fencing tests in a two node cluster, here are some details of my 
> setup:
> 
> - use stonith external/ilo for fencing (ssh to ilo board and issue a reset 
> command)
> - both nodes are connected via two bridged ethernet interfaces to two 
> redundant switches. The ilo boards are connected to the each of the 
> switches.
> 
> My first observation:
> - when removing the network cables from the node that is the DC at the 
> moment, it took at least three minutes, until it decided to stonith the 
> other node and to startup the resources that ran on the node without network 
> connectivity
> - when removing the network cables from the node that is not the DC, then it 
> was a matter of e.g. 20 seconds, then this node fenced the DC, and then 
> became DC

This definitely deserves a set of logs, etc (is your hb_report
operational? :).

> Why is there such a difference? The first one takes too long in my eyes to 
> detect the outage, but I hope there are timeout values that I can tweak. For 
> which ones shall I take a look?

deadtime in ha.cf.

> Also I recognized the following line in the logfile from the DC in the first 
> case:
> tengine: ... info: extract_event: Stonith/shutdown of <uuid> not matched
> This line shows up immediately after the DC detects that the other node is 
> unreachable. From then it takes at least two minutes until the DC decides to 
> fence the other node.

Looks like a kind of misunderstanding between the CRM and
stonithd. Again, a report would hopefully reveal what's going on.
If you could turn debug on, that'd be great. A bugzilla is
fine too.

> The second thing I observed:
> My stonith is working via ssh to the ilo board to the node that shall be 
> fenced. When I remove the ethernet cables from one node, stonith will fail 
> to kill the other node.
> 
> take case two from above, remove the cables from the node that is not the 
> DC, where I observed the following:
> The DC needs about some minutes to decide to fence the other node, because 
> of the above observed behaviour. Meanwhile the non DC node without network 
> cables tried to fence the DC, that failed, and the node was in a unclean 
> state, until the DC fenced it in the end. 
> Luckily the stonith of the DC failed, then assume instead of ssh as stonith 
> resource, use a stonith devied connected to e.g. serial port.
> In that case, the non DC node were able to fence the DC, and then become DC 
> itself, starting all resources, mounting all filesystems, ...
> Meanwhile the DC is restarted, and either heartbeat is not started 
> automatically, then the cluster is unusable, because the one node that is DC 
> has no network. Or when heartbeat is started automatically, it cannot 
> communicate to the second node, and will assume this one is dead,

and will insist on reseting it. Which would result in a yo-yo
machinery. Not entirely useful. This kind of lack of
communication is obviously detrimental, and that in spite of the
stonith configured. Right now don't see a solution to this issue.
Apart from pingd.

> and start 
> all its resources, so that e.g. filesystems could be mounted on both nodes.
> 
> I don't have a hardware fencing device to test my theory, but could that 
> happen or not? Could the usage of some ping nodes, combined with a pingd or 
> an external quorumd help to solve the dilemma?

A pingd resource with appropriate constraints would help, i.e.
something like "don't run resources if the pingd attribute is
zero".

> Well, I am running heartbeat 2.1.2-15 on sles10sp1, any hints and comments 
> are appreciated.

Thanks,

Dejan

> kind regards
> Sebastian
> 
> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to