Re: [Linux-HA] what to do on loss of network

Jonas Andradas Wed, 30 Jan 2008 08:32:59 -0800

Hello,

if my memory doesn't fail, the cib.xml file should be located in:


/var/lib/heartbeat/crm/cib.xml

The "base" cib.xml contains just the configuration.  During operation,
values are added and modified on the fly, as you say, with the score of each
node, the pingd score, and so.

Regards,

Jonas Andradas.

On Mon, Jan 28, 2008 at 9:26 PM, Steve Wray <[EMAIL PROTECTED]> wrote:

> Jonas Andradas wrote:
> > Hello,
> >
> > There is a version of heartbeat v2 for Debian etch (apt-cache show
> > heartbeat-2) but it is not the latest heartbeat version, since it is
> 2.0.7-2,
> > compared to 2.1.3.
>
> I found and installed this more recent version of heartbeat.
>
> > I don't know if the logic you say can be achieved with v1 and ipfail,
> even
> > though it is not a complex behaviour and easy to understand, but I am
> quite
> > sure it can be achieved with pingd and scores in v2.  Please, someone
> > correct me if I'm wrong, or has an idea on how to address this issue in
> > another way.
>
> I read through:
> http://www.linux-ha.org/pingd
>
> and have configured appropriately except for one thing that I don't
> quite understand; the xml.
>
> I ran this:
> /usr/lib/heartbeat/haresources2cib.py --stdout ha.cf haresources > cib.xml
>
> and the cib.xml file produced is in /etc/ha.d/ I see no documentation
> indicating another location for it.
>
> I edited this xml file and inserted the xml specified in the section
> entitled "Quickstart - Only Run my_resource on Nodes with Access to at
> Least One Ping Node".
>
> However, its not clear to me that this is the right thing to do...
>
>  From what I've read the xml should be *generated* on the fly and shared
> between the nodes. If that were the case then cib.xml wouldn't be under
> /etc/ha.d/ also, manually editing it would be foolish.
>
> I tested what I have produced and it *kind* of works. Ie:
>
> On the complete failure of network connectivity on the node which was
> secondary at time of failure the following sequence takes place:
>
> 1. The secondary node notices loss of connectivity to the primary.
> 2. It becomes primary.
> 3. It notices loss of connectivity to the network as a whole and reverts
> to secondary again.
>
> When networking is restored it works better than before and required no
> manual intervention (previously drbd had to be told what to do by an
> operator).
>
>
>
> > Jonas.
> >
> > On Jan 25, 2008 1:18 AM, Steve Wray <[EMAIL PROTECTED]> wrote:
> >
> >> Jonas Andradas wrote:
> >>> Hello,
> >>>
> >>> I think that if you switched to version 2 with CRM and use pingd
> >>> instead of ipfail, your problem could be solved: pingd gives a score
> >>> depending on how many "ping-nodes" are reacheable.  You could then set
> >>> constraints such as that if "pingd_score" is 0 (or less than a minimum
> >>> number of reacheable ping-nodes), the resources are *not* to run.
> >> I'll have to look at the possibility of using v2. I don't think that
> its
> >> supported in the stable release of our OS (Debian Etch).
> >>
> >> However... surely the logic of the situation is fairly straightforward.
> >>
> >> Ie:
> >>
> >> I am secondary at the moment. Do I have connectivity to the other node?
> >>   Yes: Do nothing, stay secondary.
> >>   No: Maybe I am going to have to take over as primary...
> >>     Do I have connectivity to the network in general?
> >>       Yes: Become primary.
> >>       No: Stay secondary
> >>
> >> In my case there is no way that both nodes will, at the same time, have
> >> network connectivity to the ping nodes and *not* have connectivity to
> >> one another.
> >>
> >> Therefore any node which has no connectivity to the ping nodes must not
> >> become the primary.
> >>
> >> On event of loss of connectivity to the ping nodes:
> >>
> >> If it was already the primary then it should become secondary until
> >> further notice. It is possible that the other node still has
> >> connectivity and has become primary.
> >>
> >> If it was already the secondary then it should stay in that state until
> >> further notice.
> >>
> >> Is this actually more complex than I make it seem?
> >>
> >>
> >>> Best regards,
> >>>
> >>> Jonas.
> >>>
> >>> 2008/1/24, Steve Wray <[EMAIL PROTECTED]>:
> >>>> Dejan Muhamedagic wrote:
> >>>>> Hi,
> >>>>>
> >>>>> On Fri, Jan 25, 2008 at 09:03:01AM +1300, Steve Wray wrote:
> >>>>>> Forgive top posting but I just noted this in some documentation:
> >>>>>>
> >>>>>> "Provided both HA nodes can communicate with each other, ipfail can
> >>>>>> reliably detect when one of their network links has become
> unusable,
> >> and
> >>>>>> compensate."
> >>>>>>
> >>>>>> In the example which I give this is not the case; the loss of
> >>>> connectivity
> >>>>>> is complete. The nodes cannot communicate with one another.
> >>>>> That's called split brain. Not a very nice thing for clusters.
> >>>>> Definitely to be avoided. See http://www.linux-ha.org/SplitBrain
> >>>>>
> >>>>>> One of the nodes can still contact its 'ping' node but not the
> other
> >> node
> >>>>>> in the cluster. It is still on the network and can still provide
> NFS
> >>>>>> service.
> >>>>>>
> >>>>>> The other node cannot contact its 'ping' node and also cannot
> contact
> >> the
> >>>>>> other node in the cluster. It is not on the network at all. It has
> a
> >> dead
> >>>>>> network connection.
> >>>>>>
> >>>>>> I need for the node with *zero* connectivity to *not* take over as
> >> the
> >>>>>> active node as this makes no sense at all; its not on the network,
> it
> >> is
> >>>>>> pointless bringing up NFS. It should just sit and wait for
> >> connectivity
> >>>> to
> >>>>>> be restored and do nothing but monitor the state of its network
> >>>> connection.
> >>>>> Neither node knows what is happening on the other side. So, they
> >>>>> both consider that the other node is dead (that's a two-node
> >>>>> cluster quorum policy which could be ensured to be sane if you
> >>>>> had stonith configured)
> >>>> I don't want the other node 'shot in the head' though. Network
> failure
> >>>> could be transitory and when it comes back I don't want to have to
> >>>> manually restart the 'shot' server.
> >>>>
> >>>>
> >>>>> and try to acquire resources. ipfail
> >>>>> could ask the node to relinquish all resources in case of no
> >>>>> connectivity, but it doesn't probably because nobody ever needed
> >>>>> such a thing.
> >>>> In this instance the two servers are on a test bed, both running on
> the
> >>>> same Xen host and connected via the Xen bridge.
> >>>>
> >>>> On the production server there are two physical hosts and these are
> >>>> connected via a crossover cable. Each duplicate of the test bed
> virtual
> >>>> machines run on each of the two physical hosts.
> >>>>
> >>>> Theres also no way to use a serial connection in this setup.
> >>>>
> >>>> The only possible channel of communication between the two nodes is
> via
> >>>> a single network connection.
> >>>>
> >>>> This network connection could fail on one or both nodes.
> >>>>
> >>>> If it fails then any node which cannot reach the network should just
> >>>> 'sit down and shut up' until network is restored (not 'turn off'
> which
> >>>> is what I read as implied by stonith).
> >>>>
> >>>> Can stonith be used to induce a transient shutdown? Eg turn off
> >>>> heartbeat and wait for network to come back, at which time turn
> >>>> heartbeat back on.
> >>>>
> >>>>
> >>>>> Thatnks,
> >>>>>
> >>>>> Dejan
> >>>>>
> >>>>>> Steve Wray wrote:
> >>>>>>> Dejan Muhamedagic wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> On Thu, Jan 24, 2008 at 09:39:05AM +1300, Steve Wray wrote:
> >>>>>>>>> Well I posted my config and I've tried various things and tested
> >> this
> >>>>>>>>> setup... and it still behaves incorrectly: going primary in the
> >> event
> >>>> of
> >>>>>>>>> a complete loss of network connectivity.
> >>>>>>>>>
> >>>>>>>>> I mean... its an NFS server... *network* filesystem. If it can't
> >>>> connect
> >>>>>>>>> to the network *at* *all* it makes no sense to become the
> primary
> >> NFS
> >>>>>>>>> server...
> >>>>>>>>>
> >>>>>>>>> I'd really appreciate some comment on what may be wrong in the
> >> config
> >>>>>>>>> files that I've posted. If theres any further info that I need
> to
> >> post
> >>>>>>>>> please mention it.
> >>>>>>>> Did you check if ipfail is running? If not, then you have to
> >>>>>>>> check the user in the respawn line. Otherwise, please post the
> >>>>>>>> logs.
> >>>>>>> Thanks for your reply!
> >>>>>>> ipfail is running, the user in the respawn line is correct.
> >>>>>>> I just ran a test failure of the network interface in the
> >> non-primary
> >>>>>>> node. Here are the logs from this test run only from the 'failed'
> >> node.
> >>>>>>> ipfail determines that "We are dead" and then heartbeat decides to
> >> take
> >>>>>>> over as primary.
> >>>>>>> Could this be a problem with "/etc/ha.d/rc.d/status status"?
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
Jonás Andradas

Skype: jontux
LinkedIn: http://www.linkedin.com/in/andradas
GPG Fingerprint:  5A90 3319 48BC E0DC 17D9
                          130B B5E2 9AFD 7649 30D5

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] what to do on loss of network

Reply via email to