Re: [Linux-HA] Heartbeat questions

Max Hofer Mon, 27 Aug 2007 02:18:14 -0700

On Saturday 25 August 2007, Daniel X Moore wrote:
> Hi All,
>
> I have several pressing questions about heartbeat - I've perused the faq
> and wiki but not come up with the answers:
>
>       - is a two node cluster with stonith safe from split-brain,
>         or do we need to configure a ping node or tiebreaker or
>         the like?
I think you misinterpret the use of STONITH. STONITH is used to ensure that 
your assumptions about a node state are made to real facts.


This means you already ASSUME you know which node is DEAD and you just make 
sure this is REALITY by shooting down the node.

Split brain = heartbeat on both nodes mean the other node is dead and thus 
STONITH the other node to make the assumed reality gets reality.

The deathmath is the result of split brain, and a STONITH configuration which 
works (that's the good news).

Repeat the mantra:

STONITH does NOT save you from split brain problems. 

As live is there are different ways to avoid split brain:

a) use redundant connectivity between the nodes and with redundant I mean 
something like:
- use IP network
- use a serial cable
In this case if your network goes down heartbeat still knows which node is 
running which resources

NOTE: 2 network cables/interface using the same network infrastructure is NOT 
a redundant connectivity. If a router/switch goes down ---> split brain

b) if you can not use redundant connectivity (for example if the two nodes are 
located too far away) you need a 3rd cluster to build a quorum. This way 
heartbeat decides which node left the cluster based on the number of 
remaining nodes.

Once again: your network design has to make sure that a network failure does 
not interrupt the connectivity between all 3 nodes. If you put all 3 nodes on 
the same LAN switch and the switch dies ---> all 3 nodes are separated ---> 
no node has a quorum ---> who do you shot down?

>       - when we have a misconfiguration, we end up with a stonith
>         "deathmatch". Both machines are either being killed or
>         commiting suicide. Disabling stonith requires the CIB to
>         be up, but as soon as it comes up, the machines get killed.
>         Is there any way to disable stonith BEFORE bringing hearbeat
>         up?
I never tested STONITH with hardware so i can not help you here.

>       - any tips for preventing "deathmatch" situations? Surely after
>         a reboot or two, it's safe to assume that more rebooting isn't
>         going to improve the situation? Perhaps we have bad timeouts
>         set or something?
I think the explanation above helps.

>       - heartbeat seems very keen to kill nodes. What's the rule for
>         when nodes get killed? I would expect that timeout and failure
>         to stop would justify stonithing, but a start/status failure?
STONITH is usually done for:
a) node died: the value of "deadtime" in your ha.cf defines when a node is 
assumed to be dead. Be aware that increasing this value means also it takes 
longer to take over  resources in case the node really dies.

b) resource in unmanaged state (failed stop): but this behaviour is 
configurable by the CIB - but this has nothing to do with split-brain (and 
you usually have no "deathmatch" in this situation).
Search for this keywords in the CIB DTD 
(http://hg.linux-ha.org/dev/file/tip/crm/crm-1.0.dtd) are:
on_fail ( fence )
stonith_enabled (crm_config)

As far as I know (please correct me if i'm wrong) does heartbeat work in this 
way:
- stonith_enabled=false ---> default for on_fail for a stop operation is set 
to "block" ---> which means it goes into unmanaged state
- when stonith_enabled=true ---> default for on_fail for a stop operation is 
set to fence (which means STONITH the node where it failed)

>
>       - the log is very verbose, but doesn't seem to include useful
>         messages such as "running 'start' for 'X' on node 'Y'" or
>         "killing node 'X' due to failure of 'status' op" etc. Am
>         I missing the significant messages? Any way of filtering the
>         basic set of "big event" messages from the rest?
>
> Hope someone can provide some guidance here. Overall, we're finding
> heartbeat works very well, but the above issues are making life somewhat
> difficult.
>
> cheers


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Heartbeat questions

Reply via email to