Hello, I think that if you switched to version 2 with CRM and use pingd instead of ipfail, your problem could be solved: pingd gives a score depending on how many "ping-nodes" are reacheable. You could then set constraints such as that if "pingd_score" is 0 (or less than a minimum number of reacheable ping-nodes), the resources are *not* to run.
Best regards, Jonas. 2008/1/24, Steve Wray <[EMAIL PROTECTED]>: > Dejan Muhamedagic wrote: > > Hi, > > > > On Fri, Jan 25, 2008 at 09:03:01AM +1300, Steve Wray wrote: > >> Forgive top posting but I just noted this in some documentation: > >> > >> "Provided both HA nodes can communicate with each other, ipfail can > >> reliably detect when one of their network links has become unusable, and > >> compensate." > >> > >> In the example which I give this is not the case; the loss of > connectivity > >> is complete. The nodes cannot communicate with one another. > > > > That's called split brain. Not a very nice thing for clusters. > > Definitely to be avoided. See http://www.linux-ha.org/SplitBrain > > > >> One of the nodes can still contact its 'ping' node but not the other node > >> in the cluster. It is still on the network and can still provide NFS > >> service. > >> > >> The other node cannot contact its 'ping' node and also cannot contact the > >> other node in the cluster. It is not on the network at all. It has a dead > >> network connection. > >> > >> I need for the node with *zero* connectivity to *not* take over as the > >> active node as this makes no sense at all; its not on the network, it is > >> pointless bringing up NFS. It should just sit and wait for connectivity > to > >> be restored and do nothing but monitor the state of its network > connection. > > > > Neither node knows what is happening on the other side. So, they > > both consider that the other node is dead (that's a two-node > > cluster quorum policy which could be ensured to be sane if you > > had stonith configured) > > I don't want the other node 'shot in the head' though. Network failure > could be transitory and when it comes back I don't want to have to > manually restart the 'shot' server. > > > > and try to acquire resources. ipfail > > could ask the node to relinquish all resources in case of no > > connectivity, but it doesn't probably because nobody ever needed > > such a thing. > > In this instance the two servers are on a test bed, both running on the > same Xen host and connected via the Xen bridge. > > On the production server there are two physical hosts and these are > connected via a crossover cable. Each duplicate of the test bed virtual > machines run on each of the two physical hosts. > > Theres also no way to use a serial connection in this setup. > > The only possible channel of communication between the two nodes is via > a single network connection. > > This network connection could fail on one or both nodes. > > If it fails then any node which cannot reach the network should just > 'sit down and shut up' until network is restored (not 'turn off' which > is what I read as implied by stonith). > > Can stonith be used to induce a transient shutdown? Eg turn off > heartbeat and wait for network to come back, at which time turn > heartbeat back on. > > > > Thatnks, > > > > Dejan > > > >> Steve Wray wrote: > >>> Dejan Muhamedagic wrote: > >>>> Hi, > >>>> > >>>> On Thu, Jan 24, 2008 at 09:39:05AM +1300, Steve Wray wrote: > >>>>> Well I posted my config and I've tried various things and tested this > >>>>> setup... and it still behaves incorrectly: going primary in the event > of > >>>>> a complete loss of network connectivity. > >>>>> > >>>>> I mean... its an NFS server... *network* filesystem. If it can't > connect > >>>>> to the network *at* *all* it makes no sense to become the primary NFS > >>>>> server... > >>>>> > >>>>> I'd really appreciate some comment on what may be wrong in the config > >>>>> files that I've posted. If theres any further info that I need to post > >>>>> please mention it. > >>>> Did you check if ipfail is running? If not, then you have to > >>>> check the user in the respawn line. Otherwise, please post the > >>>> logs. > >>> Thanks for your reply! > >>> ipfail is running, the user in the respawn line is correct. > >>> I just ran a test failure of the network interface in the non-primary > >>> node. Here are the logs from this test run only from the 'failed' node. > >>> ipfail determines that "We are dead" and then heartbeat decides to take > >>> over as primary. > >>> Could this be a problem with "/etc/ha.d/rc.d/status status"? > >>> ------------------------------------------------------------------------ > >>> _______________________________________________ > >>> Linux-HA mailing list > >>> [email protected] > >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha > >>> See also: http://linux-ha.org/ReportingProblems > >> _______________________________________________ > >> Linux-HA mailing list > >> [email protected] > >> http://lists.linux-ha.org/mailman/listinfo/linux-ha > >> See also: http://linux-ha.org/ReportingProblems > > _______________________________________________ > > Linux-HA mailing list > > [email protected] > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
