Re: [Linux-HA] Question about risk of split-brain and risk of dual-fencing

Dejan Muhamedagic Wed, 16 Dec 2009 05:53:02 -0800

Hi,

On Wed, Dec 16, 2009 at 08:56:00AM +0100, Alain.Moulle wrote:
> Hi Dejan, and thanks for responses,
> yet several remarks below ...
> Alain
> > Hi,
> > > 
> > > I'm trying to clearly evaluate the risk of split brain and the risk of 
> > > dual-fencing with pacemaker/openais in
> > > the case I can't chose anything else but having only *one* network for 
> >   
> >
> > Oops.
> >
> >   
> >> > totem protocol :
> >> >  
> >> >  Let's say we have a two-nodes cluster with stonith resources :
> >> >  - if there is a problem on one node (not a network pb) :
> >> >               the other will became DC (if not yet) and fence the node 
> >> > in failure.
> >> >  - if there is a network failure between one node and the eth switch :
> >> >             each node does not get any token anymore from the other 
> >> > node, but only the
> >> >             DC has the right to take a decision in the cluster and 
> >> > specifically the decision to fence the
> >> >             other node, so the DC node should fence the other.
> >> >             The only problem I can see here is if the "not-DC" node 
> >> > declares itself as new DC before to
> >> >             be fenced, and therefore will also decide to fence the other 
> >> > node, which could lead to a
> >> >             dual-fencing situation.  So the fence request from the 
> >> > initial DC node should happen before the
> >> >             DC Deadtime value (default 60s) to eliminate any risk of 
> >> > dual-fencing.
> >>     
> >
> > Have you ever tried this? If that indeed makes the non-DC node
> > wait with fencing, then that may help.
> >   
> No, it 's my "on paper understanding" , but I 'll try ...


OK. That deadtime may be skipped since crmd knows that the other
node is not reachable.

> >> > And if we have a more than two-nodes cluster, it seems similar for me ...
> >>     
> >
> > No, because the partition without quorum can't fence nodes. That
> > makes things simpler and more predictable.
> >   
> ... what if no-quorum-policy=ignore ?

Why would you want to set it to ignore if you have more than
two nodes?

> >> > Am I right about all this ? or did I miss something somewhere ?
> >>     
> >
> > I'm not sure if my response helps at all. You should test this
> > thoroughly. For instance, we have one bugzilla open for
> > external/ipmi where nodes did shoot each other on split brain.
> >   
> Could I have the bugzilla number ?

http://developerbugs.linux-foundation.org/show_bug.cgi?id=2071

> It's not really easy to test if we can have dual-fencing in case of 
> network failure. For example,
> I used to work with Cluster Suite for several years, with the two-nodes 
> mode, and no quorum-disk
> functionnality (it did not work fine in the begining) . In that case, 
> there is a race to
> fence between both nodes (no DC notion in CS), and RH  always  told that 
> the probability
> to have  a dual-fencing in case of heartbeat network problem is near 0 
> but not 0.

Right. It depends on the window size between fencing request reaching
the stonith plugin and the plugin actually killing a node. If the two
windows overlap, you have a problem. Obviously, the larger the window
the higher the probability.

> OK fine, but I have some big customer's sites where I have hundreds of 
> HA pairs,
> and on these sites, despite probability is near 0 , it has happened 
> several times, not many
> but several.

With which plugin? Did you file a bugzilla? Or was it with RHCS?

> So, we can't really test this dual-fencing risk, I think we 
> have to
> rely on the behavior on paper only for this specific case, and try to 
> get the configuration
> which avoids for sure dual-fencing, and also  avoids shared resources 
> mounted on both sides,

That won't happen, but if there's a high probability to have nodes shoot
each other, then that may lead to reduced availability.

Thanks,

Dejan

> that's what I'm trying to find with Pacemaker & openais.
> 
> Thanks
> Alain Moullé
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Question about risk of split-brain and risk of dual-fencing

Reply via email to