Re: [Linux-HA] Re: [Linux-ha-dev] If suicide is the answer, you're asking the wrong question

Alan Robertson Thu, 08 Nov 2007 08:07:20 -0800

Lars Marowsky-Bree wrote:

On 2007-11-08T16:04:25, Andrew Beekhof <[EMAIL PROTECTED]> wrote:

The attached table^ attempts to explain why node suicide, at least the "sosimple it can't possibly have a single bug" kind being proposed, is nosubstitute for enabling stonith (even when no plugins are configured!).


Agreed. Though under certain circumstances, "suicide" as a STONITH
plugin can be made to work reliably "enough".

From the discussion, it seems clear to me that many people are under theimpression that the proposal would protect their data in other (and farmore likely) failure situations.
This is not the case.


That is correct, but a fail-fast mechanism (suicide, reboot on certain
errors) is less likely to introduce errors than a process recovery
attempt. That's well documented in the literature.

_In combination_ with STONITH this can ensure that, for example, an
internal CRM failure keeps the node active at the heartbeat/ccm level
and thus fencing does not occur right away (it'll eventually occur due
to eventually "stop" timing out, but that is not quite the same).

Fail-fast error escalation does not replace STONITH.

There is also the issue that when stonith is enabled, node suicideguarantees the worst case (avoidable service outage) will always occur.
I object quite strongly to that.


Hm? That is not true. Node suicice does not automatically cause service
outage. The assumption is that node suicide only occurs under conditions
so severe where fencing would occur eventually anyway.

Fail-fast is about bubbling errors up faster, so that STONITH can occur
more reliably (and possibly faster) - it escalates partial, localized
but not-yet-recoverable errors to full node failures, which the cluster
can then recover (using STONITH).

If the node suicide is combined with the unsolicited "I will have
committed suicide within 5s" message, the remaining cluster very likely
can skip an additional power cycle of the node.

So to summarize: if you care about your data, enable stonith.


That summary is entirely correct, and I agree that people should not
consider "node suicide" a panacea and as a replacement for STONITH.

But that does not mean that fail-fast escalation is not still a good
idea ;-)

Perhaps the question should instead be: why haven't we already made stonithmandatory?


That's a good question indeed. Possibly because of non-shared storage
environments, but in reality, those are the minority. We probably ought
to change the default.


This is a reply to Lars' reply...
Lars said almost exactly what I would have said.  Thanks Lars!


--
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship... Let meclaim from you at all times your undisguised opinions." - WilliamWilberforce

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Re: [Linux-ha-dev] If suicide is the answer, you're asking the wrong question

Reply via email to