On Fri, Feb 19, 2010 at 11:31 PM, Steven Dake <sd...@redhat.com> wrote:
> On Fri, 2010-02-19 at 18:41 +0100, Andrew Beekhof wrote:
>> On Fri, Feb 19, 2010 at 5:36 PM, Dietmar Maurer <diet...@proxmox.com> wrote:
>> > Hi all, I just found a whitepaper from XenServer - seem they implement some
>> > kind of self-fencing:
>> >
>> > -----text from XenServer High Availability Whitepaper-------
>> > The worst-case scenario for HA is the situation where a host is thought to 
>> > be off-line but is actually
>> > still writing to the shared storage, because this can result in corruption 
>> > of persistent data. To
>> > prevent this situation without requiring active power strip controls, 
>> > XenServer employs
>> > hypervisor-level fencing. This is a Xen modification which hard-powers off 
>> > the host at a very
>> > low-level if it does not hear regularly from a watchdog process running in 
>> > the control domain.
>> > Because it is implemented at a very low-level, this also protects the 
>> > storage in the case where the
>> > control domain becomes unresponsive for some reason.
>> > --------------
>> >
>> > Does that really make sense? That seem to be a very unreliable solution,
>> > because there is no guarantee that a failed node 'self-fence' itself? Or
>> > do I miss something?
>>
>> Do you trust a host, that has already failed in some way, to now start
>> behaving correctly and fence itself?  I wouldn't.
>
> It really depends on the fencing model and what you believe to be more
> reliable.  One model says "tell node X to fence" (power fencing) while
> the alternative model says "if I don't tell you my health is good,
> please self-fence" (watchdog fencing).
>
> There are millions of lines of C code involved in directing a power
> fencing device to fence a node.  Generally in this case, the system
> directing the fencing is operating from a known good state.
>
> There are several hundred lines of C code that trigger a reboot when a
> watchdog timer isn't fed.  Generally in this case, the system directing
> the fencing (itself) has entered an undefined failure state.
>
> So a quick matrix:
> model            LOC       operating environment
> power fencing    millions  well-defined
> self fencing     hundreds  undefined
>
> Knowing well how software works, I personally would trust the code with
> hundreds of orders of magnitude less LOC, even when operating in an
> undefined state.  The watchdog code (softdog) in the kernel is super
> simple, and relies only on timer interrupts.  It is possible the timer
> interrupts won't be delivered, in which case an NMI watchdog timer
> (which is hardware based) can be used to watch for that situation.  It
> is possible for errant kernel code to corrupt the timer list that the
> kernel uses to expire timers.  If this happens, self-fencing using
> software watchdogs will fail gloriously.
>
> When considering hardware watchdog timer devices, the decision becomes
> even more clear, since a hardware watchdog timer has almost complete
> isolation from the system in which it is integrated.  Also it is
> designed and hardened around one purpose - to powercycle a system if it
> is not fed a healthcheck.
>
> Expanding the matrix:
> model             LOC       operating environment
> power fencing     millions  well-defined
> software watchdog hundreds  undefined
> hardware watchdog ASIC      well-defined
>
> In the case of a hardware watchdog, the LOC is hidden behind a self
> contained ASIC.  This ASIC could be defective in some way.  But it is
> also isolated from the remaining system so that it operates in a
> well-defined environment.
>
> Compare those with the failure scenarios of power fencing:
> 1) the power fencing device could have failed in some way
> 2) the power fencing device could process a request incorrectly
> 3) the code that interfaces with the power fencing device could be
> defective in some conditions
> 4) the power fencing hardware could fail to reset its relays for the
> node to be rebooted
> 5) the fencing system directing the fencing could fail in its
> communication to the fencing device
> 6) the network switch connecting the fencing device to the host systems
> could have a transient failure to the particular port on which the power
> fencing device is configured
> ... think up your own ...
>
> There are thousands of interactions with power fencing and every one of
> them needs to work perfectly for power fencing to work.

Thats not the problem.
Its the false positives you need to worry about (devices that report
success when power fencing failed).

When power fencing fails healthy nodes get some sort of indication and
can take appropriate action.
If suicide fails, um...

I'd rather take my (incredibly slim) chances of a false positive than
trust something where
a) there is no-one to confirm it was successful
b) no matter how you slice it, you're trusting a defective node to
function correctly - which is an oxymoron

The whole goal of fencing is "to be sure"... "Could the node be doing
something bad? I don't know, lets be _sure_"
Suicide simply does not give you that because there is no confirmation.
The healthy nodes can only assume it worked and continue.  They hope,
but they cannot be sure. Particularly if there's a network outage.

I'll take "I turned its power off" any day of the week ;-)

> On the plus
> side, the system is operating in a known good state rather then an
> undefined failure condition.
>
> Neither system is perfect, and it is likely a matter of opinion which
> you choose.

That much is certain, at least we can all agree that emacs is the
better editor ;-)

> ATM there are no good watchdog based cluster fencing
> implementations available in the community but something I'd like to
> tackle.
>
_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to