On Mon, Aug 27, 2007 at 11:17:41AM +0200, Max Hofer wrote: > On Saturday 25 August 2007, Daniel X Moore wrote: > > Hi All, > > > > I have several pressing questions about heartbeat - I've perused the faq > > and wiki but not come up with the answers: > > > > - is a two node cluster with stonith safe from split-brain, > > or do we need to configure a ping node or tiebreaker or > > the like? > I think you misinterpret the use of STONITH. STONITH is used to ensure that > your assumptions about a node state are made to real facts. > > This means you already ASSUME you know which node is DEAD and you just make > sure this is REALITY by shooting down the node. > > Split brain = heartbeat on both nodes mean the other node is dead and thus > STONITH the other node to make the assumed reality gets reality. > > The deathmath is the result of split brain, and a STONITH configuration which > works (that's the good news). > > Repeat the mantra: > > STONITH does NOT save you from split brain problems. > > As live is there are different ways to avoid split brain: > > a) use redundant connectivity between the nodes and with redundant I mean > something like: > - use IP network > - use a serial cable > In this case if your network goes down heartbeat still knows which node is > running which resources > > NOTE: 2 network cables/interface using the same network infrastructure is NOT > a redundant connectivity. If a router/switch goes down ---> split brain > > b) if you can not use redundant connectivity (for example if the two nodes > are > located too far away) you need a 3rd cluster to build a quorum. This way > heartbeat decides which node left the cluster based on the number of > remaining nodes. > > Once again: your network design has to make sure that a network failure does > not interrupt the connectivity between all 3 nodes. If you put all 3 nodes on > the same LAN switch and the switch dies ---> all 3 nodes are separated ---> > no node has a quorum ---> who do you shot down? > > > - when we have a misconfiguration, we end up with a stonith > > "deathmatch". Both machines are either being killed or > > commiting suicide. Disabling stonith requires the CIB to > > be up, but as soon as it comes up, the machines get killed. > > Is there any way to disable stonith BEFORE bringing hearbeat > > up? > I never tested STONITH with hardware so i can not help you here.
I don't think so. The way you could do that is to move your cib out of /var/lib/heartbeat/crm and remove the other files in this directory on both nodes, then start the cluster. Then modify the cib in such a way so that stonith won't work: set the stonith resource's target-role to stopped or set stonith-enabled in the cluster_property_set to false. Finally apply the new cib: cibadmin -C -x cib.xml. But, beware: if the cluster wants to kill the other node there must be a good reason for that. Note that your data might be at stake here. > > - any tips for preventing "deathmatch" situations? Surely after > > a reboot or two, it's safe to assume that more rebooting isn't > > going to improve the situation? Perhaps we have bad timeouts > > set or something? > I think the explanation above helps. > > > - heartbeat seems very keen to kill nodes. What's the rule for > > when nodes get killed? I would expect that timeout and failure > > to stop would justify stonithing, but a start/status failure? > STONITH is usually done for: > a) node died: the value of "deadtime" in your ha.cf defines when a node is > assumed to be dead. Be aware that increasing this value means also it takes > longer to take over resources in case the node really dies. > > b) resource in unmanaged state (failed stop): but this behaviour is > configurable by the CIB - but this has nothing to do with split-brain (and > you usually have no "deathmatch" in this situation). > Search for this keywords in the CIB DTD > (http://hg.linux-ha.org/dev/file/tip/crm/crm-1.0.dtd) are: > on_fail ( fence ) > stonith_enabled (crm_config) > > As far as I know (please correct me if i'm wrong) does heartbeat work in this > way: > - stonith_enabled=false ---> default for on_fail for a stop operation is set > to "block" ---> which means it goes into unmanaged state > - when stonith_enabled=true ---> default for on_fail for a stop operation is > set to fence (which means STONITH the node where it failed) > > > > > - the log is very verbose, but doesn't seem to include useful Yes the log is too verbose. > > messages such as "running 'start' for 'X' on node 'Y'" or I wondered about this one too. However, it seems like people are annoyed with such messages. Note that a repeating monitor operation would include such a message every now and then. On the other hand, we could include a message for all non-recurring operations. > > "killing node 'X' due to failure of 'status' op" etc. Am I think that the node trying to stonith another one should probably say so. And in particular it should say if it failed. Are you sure that there are no messages reporting that? > > I missing the significant messages? Any way of filtering the > > basic set of "big event" messages from the rest? The only way you could filter is based on the severity level. > > Hope someone can provide some guidance here. Overall, we're finding > > heartbeat works very well, but the above issues are making life somewhat > > difficult. > > > > cheers > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
