Hi, On Wed, Dec 16, 2009 at 08:56:00AM +0100, Alain.Moulle wrote: > Hi Dejan, and thanks for responses, > yet several remarks below ... > Alain > > Hi, > > > > > > I'm trying to clearly evaluate the risk of split brain and the risk of > > > dual-fencing with pacemaker/openais in > > > the case I can't chose anything else but having only *one* network for > > > > > > Oops. > > > > > >> > totem protocol : > >> > > >> > Let's say we have a two-nodes cluster with stonith resources : > >> > - if there is a problem on one node (not a network pb) : > >> > the other will became DC (if not yet) and fence the node > >> > in failure. > >> > - if there is a network failure between one node and the eth switch : > >> > each node does not get any token anymore from the other > >> > node, but only the > >> > DC has the right to take a decision in the cluster and > >> > specifically the decision to fence the > >> > other node, so the DC node should fence the other. > >> > The only problem I can see here is if the "not-DC" node > >> > declares itself as new DC before to > >> > be fenced, and therefore will also decide to fence the other > >> > node, which could lead to a > >> > dual-fencing situation. So the fence request from the > >> > initial DC node should happen before the > >> > DC Deadtime value (default 60s) to eliminate any risk of > >> > dual-fencing. > >> > > > > Have you ever tried this? If that indeed makes the non-DC node > > wait with fencing, then that may help. > > > No, it 's my "on paper understanding" , but I 'll try ...
OK. That deadtime may be skipped since crmd knows that the other node is not reachable. > >> > And if we have a more than two-nodes cluster, it seems similar for me ... > >> > > > > No, because the partition without quorum can't fence nodes. That > > makes things simpler and more predictable. > > > ... what if no-quorum-policy=ignore ? Why would you want to set it to ignore if you have more than two nodes? > >> > Am I right about all this ? or did I miss something somewhere ? > >> > > > > I'm not sure if my response helps at all. You should test this > > thoroughly. For instance, we have one bugzilla open for > > external/ipmi where nodes did shoot each other on split brain. > > > Could I have the bugzilla number ? http://developerbugs.linux-foundation.org/show_bug.cgi?id=2071 > It's not really easy to test if we can have dual-fencing in case of > network failure. For example, > I used to work with Cluster Suite for several years, with the two-nodes > mode, and no quorum-disk > functionnality (it did not work fine in the begining) . In that case, > there is a race to > fence between both nodes (no DC notion in CS), and RH always told that > the probability > to have a dual-fencing in case of heartbeat network problem is near 0 > but not 0. Right. It depends on the window size between fencing request reaching the stonith plugin and the plugin actually killing a node. If the two windows overlap, you have a problem. Obviously, the larger the window the higher the probability. > OK fine, but I have some big customer's sites where I have hundreds of > HA pairs, > and on these sites, despite probability is near 0 , it has happened > several times, not many > but several. With which plugin? Did you file a bugzilla? Or was it with RHCS? > So, we can't really test this dual-fencing risk, I think we > have to > rely on the behavior on paper only for this specific case, and try to > get the configuration > which avoids for sure dual-fencing, and also avoids shared resources > mounted on both sides, That won't happen, but if there's a high probability to have nodes shoot each other, then that may lead to reduced availability. Thanks, Dejan > that's what I'm trying to find with Pacemaker & openais. > > Thanks > Alain Moullé > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
