On Saturday 25 August 2007, Daniel X Moore wrote: > Hi All, > > I have several pressing questions about heartbeat - I've perused the faq > and wiki but not come up with the answers: > > - is a two node cluster with stonith safe from split-brain, > or do we need to configure a ping node or tiebreaker or > the like? I think you misinterpret the use of STONITH. STONITH is used to ensure that your assumptions about a node state are made to real facts.
This means you already ASSUME you know which node is DEAD and you just make sure this is REALITY by shooting down the node. Split brain = heartbeat on both nodes mean the other node is dead and thus STONITH the other node to make the assumed reality gets reality. The deathmath is the result of split brain, and a STONITH configuration which works (that's the good news). Repeat the mantra: STONITH does NOT save you from split brain problems. As live is there are different ways to avoid split brain: a) use redundant connectivity between the nodes and with redundant I mean something like: - use IP network - use a serial cable In this case if your network goes down heartbeat still knows which node is running which resources NOTE: 2 network cables/interface using the same network infrastructure is NOT a redundant connectivity. If a router/switch goes down ---> split brain b) if you can not use redundant connectivity (for example if the two nodes are located too far away) you need a 3rd cluster to build a quorum. This way heartbeat decides which node left the cluster based on the number of remaining nodes. Once again: your network design has to make sure that a network failure does not interrupt the connectivity between all 3 nodes. If you put all 3 nodes on the same LAN switch and the switch dies ---> all 3 nodes are separated ---> no node has a quorum ---> who do you shot down? > - when we have a misconfiguration, we end up with a stonith > "deathmatch". Both machines are either being killed or > commiting suicide. Disabling stonith requires the CIB to > be up, but as soon as it comes up, the machines get killed. > Is there any way to disable stonith BEFORE bringing hearbeat > up? I never tested STONITH with hardware so i can not help you here. > - any tips for preventing "deathmatch" situations? Surely after > a reboot or two, it's safe to assume that more rebooting isn't > going to improve the situation? Perhaps we have bad timeouts > set or something? I think the explanation above helps. > - heartbeat seems very keen to kill nodes. What's the rule for > when nodes get killed? I would expect that timeout and failure > to stop would justify stonithing, but a start/status failure? STONITH is usually done for: a) node died: the value of "deadtime" in your ha.cf defines when a node is assumed to be dead. Be aware that increasing this value means also it takes longer to take over resources in case the node really dies. b) resource in unmanaged state (failed stop): but this behaviour is configurable by the CIB - but this has nothing to do with split-brain (and you usually have no "deathmatch" in this situation). Search for this keywords in the CIB DTD (http://hg.linux-ha.org/dev/file/tip/crm/crm-1.0.dtd) are: on_fail ( fence ) stonith_enabled (crm_config) As far as I know (please correct me if i'm wrong) does heartbeat work in this way: - stonith_enabled=false ---> default for on_fail for a stop operation is set to "block" ---> which means it goes into unmanaged state - when stonith_enabled=true ---> default for on_fail for a stop operation is set to fence (which means STONITH the node where it failed) > > - the log is very verbose, but doesn't seem to include useful > messages such as "running 'start' for 'X' on node 'Y'" or > "killing node 'X' due to failure of 'status' op" etc. Am > I missing the significant messages? Any way of filtering the > basic set of "big event" messages from the rest? > > Hope someone can provide some guidance here. Overall, we're finding > heartbeat works very well, but the above issues are making life somewhat > difficult. > > cheers _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
