Re: [Linux-HA] Re: [Linux-ha-dev] If suicide is the answer, you're asking the wrong question

Andrew Beekhof Thu, 08 Nov 2007 08:37:49 -0800


On Nov 8, 2007, at 4:18 PM, Lars Marowsky-Bree wrote:

On 2007-11-08T16:04:25, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
The attached table^ attempts to explain why node suicide, at leastthe "sosimple it can't possibly have a single bug" kind being proposed, isnosubstitute for enabling stonith (even when no plugins areconfigured!).
Agreed. Though under certain circumstances, "suicide" as a STONITH
plugin can be made to work reliably "enough".


sure

From the discussion, it seems clear to me that many people areunder theimpression that the proposal would protect their data in other (andfar
more likely) failure situations.
This is not the case.


That is correct, but a fail-fast mechanism (suicide, reboot on certain
errors) is less likely to introduce errors than a process recovery
attempt. That's well documented in the literature.

_In combination_ with STONITH this can ensure that, for example, an
internal CRM failure keeps the node active at the heartbeat/ccm level
and thus fencing does not occur right away (it'll eventually occur due
to eventually "stop" timing out, but that is not quite the same).

Fail-fast error escalation does not replace STONITH.

There is also the issue that when stonith is enabled, node suicide
guarantees the worst case (avoidable service outage) will alwaysoccur.
I object quite strongly to that.

Hm? That is not true. Node suicice does not automatically causeservice

outage.


Yes it does - that was the initial point, to address 1762.

The assumption is that node suicide only occurs under conditions
so severe where fencing would occur eventually anyway.

My understanding of the proposal was that "any child process exiting,ever" is counted as "so severe"

At least it would have to be that way to make it relevant to bug 1762.

So with suicide enabled, the node is always rebooted, stopping allservices it has running.With only stonith enabled, the cluster will normally recoversufficiently fast enough to avoid fencing.

Therefor there are times when the node would commit suicide whenstonith would not have been needed.

That qualifies as avoidable service outage.

Rebooting because processes are respawning themselves several times asecond or returning 100 (as if to say, i can't continue)... yeah, thati see adding value over stonith on its own.

But not any process and every time and stonith is enabled.

Fail-fast is about bubbling errors up faster, so that STONITH canoccur
more reliably (and possibly faster) - it escalates partial, localized
but not-yet-recoverable errors


Assuming that a process exiting is a "not-yet-recoverable" error.

However for the mostly likely cases, at least in the crm, this is nottrue.

to full node failures, which the cluster
can then recover (using STONITH).

If the node suicide is combined with the unsolicited "I will have

committed suicide within 5s" message, the remaining cluster verylikely

can skip an additional power cycle of the node.

So to summarize: if you care about your data, enable stonith.


That summary is entirely correct, and I agree that people should not
consider "node suicide" a panacea and as a replacement for STONITH.

But that does not mean that fail-fast escalation is not still a good
idea ;-)


True.

But I don't like this particular proposal, as it has so far beenexplained, nor do I see the point of it when split-brain is a holewide enough to drive a truck through and far more likely to occur.

Perhaps the question should instead be: why haven't we already madestonith
mandatory?
That's a good question indeed. Possibly because of non-shared storage
environments, but in reality, those are the minority. We probablyought
to change the default.


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Re: [Linux-ha-dev] If suicide is the answer, you're asking the wrong question

Reply via email to