FW: [Linux-HA] what to do on loss of network

Kettunen Janne ENFO Thu, 24 Jan 2008 22:49:06 -0800

Correction. I ment to say that splitbrain detection should be done when
nodes see each other again (even at network level). CRM status messages
do move when connection between nodes is back, but other node don't
accept messages from other node.


-----Original Message-----
From: Kettunen Janne ENFO 
Sent: 25. tammikuuta 2008 8:43
To: 'General Linux-HA mailing list'
Subject: RE: [Linux-HA] what to do on loss of network

Hi.

I have SLES10 SP1 HA 2.0.8 split site two node cluster and I've
configured pingd clone resource to make resource location constrains. It
works very well. My ping node is Iscsi server in third site from where
cluster node mounts its resource disk. If I disconnect all communication
paths between nodes active node correctly stops resource because it
loses also ping node connection. But there is definetly split brain
going on (both think they are DC).

I have tried to figure out how could nodes detect this split brain
status and fence them self, but have not succeeded.

Regards ...

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Jonas Andradas
Sent: 25. tammikuuta 2008 0:30
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] what to do on loss of network

Hello,

I think that if you switched to version 2 with CRM and use pingd
instead of ipfail, your problem could be solved: pingd gives a score
depending on how many "ping-nodes" are reacheable.  You could then set
constraints such as that if "pingd_score" is 0 (or less than a minimum
number of reacheable ping-nodes), the resources are *not* to run.

Best regards,

Jonas.

2008/1/24, Steve Wray <[EMAIL PROTECTED]>:
> Dejan Muhamedagic wrote:
> > Hi,
> >
> > On Fri, Jan 25, 2008 at 09:03:01AM +1300, Steve Wray wrote:
> >> Forgive top posting but I just noted this in some documentation:
> >>
> >> "Provided both HA nodes can communicate with each other, ipfail can
> >> reliably detect when one of their network links has become
unusable, and
> >> compensate."
> >>
> >> In the example which I give this is not the case; the loss of
> connectivity
> >> is complete. The nodes cannot communicate with one another.
> >
> > That's called split brain. Not a very nice thing for clusters.
> > Definitely to be avoided. See http://www.linux-ha.org/SplitBrain
> >
> >> One of the nodes can still contact its 'ping' node but not the
other node
> >> in the cluster. It is still on the network and can still provide
NFS
> >> service.
> >>
> >> The other node cannot contact its 'ping' node and also cannot
contact the
> >> other node in the cluster. It is not on the network at all. It has
a dead
> >> network connection.
> >>
> >> I need for the node with *zero* connectivity to *not* take over as
the
> >> active node as this makes no sense at all; its not on the network,
it is
> >> pointless bringing up NFS. It should just sit and wait for
connectivity
> to
> >> be restored and do nothing but monitor the state of its network
> connection.
> >
> > Neither node knows what is happening on the other side. So, they
> > both consider that the other node is dead (that's a two-node
> > cluster quorum policy which could be ensured to be sane if you
> > had stonith configured)
>
> I don't want the other node 'shot in the head' though. Network failure
> could be transitory and when it comes back I don't want to have to
> manually restart the 'shot' server.
>
>
> > and try to acquire resources. ipfail
> > could ask the node to relinquish all resources in case of no
> > connectivity, but it doesn't probably because nobody ever needed
> > such a thing.
>
> In this instance the two servers are on a test bed, both running on
the
> same Xen host and connected via the Xen bridge.
>
> On the production server there are two physical hosts and these are
> connected via a crossover cable. Each duplicate of the test bed
virtual
> machines run on each of the two physical hosts.
>
> Theres also no way to use a serial connection in this setup.
>
> The only possible channel of communication between the two nodes is
via
> a single network connection.
>
> This network connection could fail on one or both nodes.
>
> If it fails then any node which cannot reach the network should just
> 'sit down and shut up' until network is restored (not 'turn off' which
> is what I read as implied by stonith).
>
> Can stonith be used to induce a transient shutdown? Eg turn off
> heartbeat and wait for network to come back, at which time turn
> heartbeat back on.
>
>
> > Thatnks,
> >
> > Dejan
> >
> >> Steve Wray wrote:
> >>> Dejan Muhamedagic wrote:
> >>>> Hi,
> >>>>
> >>>> On Thu, Jan 24, 2008 at 09:39:05AM +1300, Steve Wray wrote:
> >>>>> Well I posted my config and I've tried various things and tested
this
> >>>>> setup... and it still behaves incorrectly: going primary in the
event
> of
> >>>>> a complete loss of network connectivity.
> >>>>>
> >>>>> I mean... its an NFS server... *network* filesystem. If it can't
> connect
> >>>>> to the network *at* *all* it makes no sense to become the
primary NFS
> >>>>> server...
> >>>>>
> >>>>> I'd really appreciate some comment on what may be wrong in the
config
> >>>>> files that I've posted. If theres any further info that I need
to post
> >>>>> please mention it.
> >>>> Did you check if ipfail is running? If not, then you have to
> >>>> check the user in the respawn line. Otherwise, please post the
> >>>> logs.
> >>> Thanks for your reply!
> >>> ipfail is running, the user in the respawn line is correct.
> >>> I just ran a test failure of the network interface in the
non-primary
> >>> node. Here are the logs from this test run only from the 'failed'
node.
> >>> ipfail determines that "We are dead" and then heartbeat decides to
take
> >>> over as primary.
> >>> Could this be a problem with "/etc/ha.d/rc.d/status status"?
> >>>
------------------------------------------------------------------------
> >>> _______________________________________________
> >>> Linux-HA mailing list
> >>> [email protected]
> >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >>> See also: http://linux-ha.org/ReportingProblems
> >> _______________________________________________
> >> Linux-HA mailing list
> >> [email protected]
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

FW: [Linux-HA] what to do on loss of network

Reply via email to