Re: [Linux-cluster] Fenced failing continuously

Ian Hayes Mon, 13 Apr 2009 09:19:37 -0700

I realize that the ssh option is not optimal, but I'm stuck with the design
requirements. I'm hoping I can get them changed.


But, this got me thinking... conventional fencing is not failsafe. I can
think of quite a number of less than optimal but entirely real-world
situations where a node can die and not be able to be absolutely fenced off.
iLO only works of the victim node still has power. I've only been in 1 shop
that had the APC managed power, and they didn't even have that set up.
Brocade fencing doesn't always apply, especially if you're just doing a
virtual IP. So sometimes having a second fencing method as a backup may not
always be feasible.

So even with more traditional fences, this may not work unless I start
modding fence scripts to return a success code even if they fail.

On Fri, Apr 10, 2009 at 2:36 AM, Virginian <[email protected]>wrote:

>  Hi Ian,
>
> I think there is a flaw in the design. For example, say the network card
> fails on machine A. Machine B detects this and tries to fence machine A. The
> problem with doing it via ssh to modify iptables is that there is no network
> connectivity to Machine A and hence this mechanism will never work. What you
> need is a solution that works independently of the OS such as a power switch
> or remote management interface such as IBM RSA II, HP iLO etc. With fencing,
> the solution has to be absolute and ruthless in that, in this example,
> machine B needs to be able to fence Machine A absolutely every time there is
> a problem and as soon as there is a problem.
>
> Regards
>
> John
>
>
>
> ----- Original Message -----
> *From:* Ian Hayes <[email protected]>
> *To:* [email protected]
> *Sent:* Friday, April 10, 2009 1:07 AM
> *Subject:* [Linux-cluster] Fenced failing continuously
>
> I've been testing a newly built 2-node cluster. The cluster resources are a
> virtual IP and squid, so in a node failure, the VIP would go to the
> surviving node and start up Squid. I'm running a modified fencing agent that
> will SSH into the failing node and firewall it off via IPtables (not my
> choice).
>
> This all works fine for graceful shutdowns, but when I do something nasty
> like pulling the power cord on the node that is currently running the
> service, the surviving node never assumes the service and spends all its
> time trying to fire off the fence agent, which obviously will not work
> because the server is completely offline. The only way I can get the
> surviving node to assume the VIP and start Squid is to fence_ack_manual,
> which sort of runs counter to running a cluster to begin with. The logs are
> filled with
>
> Apr 12 00:01:44 <hostname> fenced[3223]: fencing node "<otherhost>"
>  Could not disable xx.xx.xx.xx on    23]: agent "fence_iptables" reports:
> ssh: connect to host xx.xx.xx.xx port 22: No route to host
>
> Is this a misconfiguration, or is there an option I can include somewhere
> to tell the nodes to give it up after a certain number of tries?
>
> ------------------------------
>
> --
> Linux-cluster mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] Fenced failing continuously

Reply via email to