I realize that the ssh option is not optimal, but I'm stuck with the design requirements. I'm hoping I can get them changed.
But, this got me thinking... conventional fencing is not failsafe. I can think of quite a number of less than optimal but entirely real-world situations where a node can die and not be able to be absolutely fenced off. iLO only works of the victim node still has power. I've only been in 1 shop that had the APC managed power, and they didn't even have that set up. Brocade fencing doesn't always apply, especially if you're just doing a virtual IP. So sometimes having a second fencing method as a backup may not always be feasible. So even with more traditional fences, this may not work unless I start modding fence scripts to return a success code even if they fail. On Fri, Apr 10, 2009 at 2:36 AM, Virginian <[email protected]>wrote: > Hi Ian, > > I think there is a flaw in the design. For example, say the network card > fails on machine A. Machine B detects this and tries to fence machine A. The > problem with doing it via ssh to modify iptables is that there is no network > connectivity to Machine A and hence this mechanism will never work. What you > need is a solution that works independently of the OS such as a power switch > or remote management interface such as IBM RSA II, HP iLO etc. With fencing, > the solution has to be absolute and ruthless in that, in this example, > machine B needs to be able to fence Machine A absolutely every time there is > a problem and as soon as there is a problem. > > Regards > > John > > > > ----- Original Message ----- > *From:* Ian Hayes <[email protected]> > *To:* [email protected] > *Sent:* Friday, April 10, 2009 1:07 AM > *Subject:* [Linux-cluster] Fenced failing continuously > > I've been testing a newly built 2-node cluster. The cluster resources are a > virtual IP and squid, so in a node failure, the VIP would go to the > surviving node and start up Squid. I'm running a modified fencing agent that > will SSH into the failing node and firewall it off via IPtables (not my > choice). > > This all works fine for graceful shutdowns, but when I do something nasty > like pulling the power cord on the node that is currently running the > service, the surviving node never assumes the service and spends all its > time trying to fire off the fence agent, which obviously will not work > because the server is completely offline. The only way I can get the > surviving node to assume the VIP and start Squid is to fence_ack_manual, > which sort of runs counter to running a cluster to begin with. The logs are > filled with > > Apr 12 00:01:44 <hostname> fenced[3223]: fencing node "<otherhost>" > Could not disable xx.xx.xx.xx on 23]: agent "fence_iptables" reports: > ssh: connect to host xx.xx.xx.xx port 22: No route to host > > Is this a misconfiguration, or is there an option I can include somewhere > to tell the nodes to give it up after a certain number of tries? > > ------------------------------ > > -- > Linux-cluster mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/linux-cluster >
-- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster
