Re: [Linux-HA] what to do on loss of network

Jonas Andradas Thu, 24 Jan 2008 14:30:49 -0800

Hello,

I think that if you switched to version 2 with CRM and use pingd
instead of ipfail, your problem could be solved: pingd gives a score
depending on how many "ping-nodes" are reacheable.  You could then set
constraints such as that if "pingd_score" is 0 (or less than a minimum
number of reacheable ping-nodes), the resources are *not* to run.


Best regards,

Jonas.

2008/1/24, Steve Wray <[EMAIL PROTECTED]>:
> Dejan Muhamedagic wrote:
> > Hi,
> >
> > On Fri, Jan 25, 2008 at 09:03:01AM +1300, Steve Wray wrote:
> >> Forgive top posting but I just noted this in some documentation:
> >>
> >> "Provided both HA nodes can communicate with each other, ipfail can
> >> reliably detect when one of their network links has become unusable, and
> >> compensate."
> >>
> >> In the example which I give this is not the case; the loss of
> connectivity
> >> is complete. The nodes cannot communicate with one another.
> >
> > That's called split brain. Not a very nice thing for clusters.
> > Definitely to be avoided. See http://www.linux-ha.org/SplitBrain
> >
> >> One of the nodes can still contact its 'ping' node but not the other node
> >> in the cluster. It is still on the network and can still provide NFS
> >> service.
> >>
> >> The other node cannot contact its 'ping' node and also cannot contact the
> >> other node in the cluster. It is not on the network at all. It has a dead
> >> network connection.
> >>
> >> I need for the node with *zero* connectivity to *not* take over as the
> >> active node as this makes no sense at all; its not on the network, it is
> >> pointless bringing up NFS. It should just sit and wait for connectivity
> to
> >> be restored and do nothing but monitor the state of its network
> connection.
> >
> > Neither node knows what is happening on the other side. So, they
> > both consider that the other node is dead (that's a two-node
> > cluster quorum policy which could be ensured to be sane if you
> > had stonith configured)
>
> I don't want the other node 'shot in the head' though. Network failure
> could be transitory and when it comes back I don't want to have to
> manually restart the 'shot' server.
>
>
> > and try to acquire resources. ipfail
> > could ask the node to relinquish all resources in case of no
> > connectivity, but it doesn't probably because nobody ever needed
> > such a thing.
>
> In this instance the two servers are on a test bed, both running on the
> same Xen host and connected via the Xen bridge.
>
> On the production server there are two physical hosts and these are
> connected via a crossover cable. Each duplicate of the test bed virtual
> machines run on each of the two physical hosts.
>
> Theres also no way to use a serial connection in this setup.
>
> The only possible channel of communication between the two nodes is via
> a single network connection.
>
> This network connection could fail on one or both nodes.
>
> If it fails then any node which cannot reach the network should just
> 'sit down and shut up' until network is restored (not 'turn off' which
> is what I read as implied by stonith).
>
> Can stonith be used to induce a transient shutdown? Eg turn off
> heartbeat and wait for network to come back, at which time turn
> heartbeat back on.
>
>
> > Thatnks,
> >
> > Dejan
> >
> >> Steve Wray wrote:
> >>> Dejan Muhamedagic wrote:
> >>>> Hi,
> >>>>
> >>>> On Thu, Jan 24, 2008 at 09:39:05AM +1300, Steve Wray wrote:
> >>>>> Well I posted my config and I've tried various things and tested this
> >>>>> setup... and it still behaves incorrectly: going primary in the event
> of
> >>>>> a complete loss of network connectivity.
> >>>>>
> >>>>> I mean... its an NFS server... *network* filesystem. If it can't
> connect
> >>>>> to the network *at* *all* it makes no sense to become the primary NFS
> >>>>> server...
> >>>>>
> >>>>> I'd really appreciate some comment on what may be wrong in the config
> >>>>> files that I've posted. If theres any further info that I need to post
> >>>>> please mention it.
> >>>> Did you check if ipfail is running? If not, then you have to
> >>>> check the user in the respawn line. Otherwise, please post the
> >>>> logs.
> >>> Thanks for your reply!
> >>> ipfail is running, the user in the respawn line is correct.
> >>> I just ran a test failure of the network interface in the non-primary
> >>> node. Here are the logs from this test run only from the 'failed' node.
> >>> ipfail determines that "We are dead" and then heartbeat decides to take
> >>> over as primary.
> >>> Could this be a problem with "/etc/ha.d/rc.d/status status"?
> >>> ------------------------------------------------------------------------
> >>> _______________________________________________
> >>> Linux-HA mailing list
> >>> [email protected]
> >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >>> See also: http://linux-ha.org/ReportingProblems
> >> _______________________________________________
> >> Linux-HA mailing list
> >> [email protected]
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] what to do on loss of network

Reply via email to