Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Alan Robertson Tue, 06 Nov 2007 12:46:38 -0800

Andrew Beekhof wrote:

On Nov 6, 2007, at 6:25 PM, Alan Robertson wrote:
We now have the ComponentFail test in CTS. Thanks Lars for getting itgoing!
And, in the process, it's showing up some kinds of problems that wehadn't been looking for before. A couple examples of such problemscan be found here:
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762
It is very rare for a stonith action to be actually initiated in this case.
But having stonith disabled results in very dangerous yet unavoidableassumptions being made.
Which is why stonith is so highly encouraged.
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732

The question that comes up is this:
For problems that should "never" happen like death of one of ourcore/key processes, is an immediate reboot of the machine the rightrecovery technique?
The advantages of such a choice include:
It is fast
It will invoke recovery paths that we exercise a lot in testing
It is MUCH simpler than trying to recover from all these cases,
    therefore almost certainly more reliable

The disadvantages of such a choice include:
It is crude, and very annoying
It probably shouldn't be invoked for single-node clusters (?)
It could be criticized as being lazy
It shouldn't be invoked if there is another simple and correct method

Continual rebooting becomes a possibility...
Assuming continual re-failure of one of our processes, yes.
We do not have a policy of doing this throughout the project, what wehave is a few places where we do it.
I propose that we should consider making a uniform policy decision forthe project - and specifically decide to use ungraceful reboots as ourrecovery method for "key" processes dying (for example: CCM,heartbeat, CIB, CRM). It should work for those cases where peopledon't configure in watchdogs or explicitly define any STONITH devices,and also independently of quorum policies - because AFAIK it seemslike the right choice, there's no technical reason not to do so.My inclination is to think that this is a good approach to take forproblems that in our best-guess judgment "shouldn't happen".
I dislike it for the reason that node suicide provides a false sense ofsecurity.You end up making the window of opportunity for "something bad" tohappen smaller, but it still exists.

If you have STONITH configured, the two methods are equally safe. Ifyou don't have STONITH configured, then my suggested approach issignificantly superior. The window for damage is very small - heartbeatis a realtime process, and it is also the same process that is sendingout the "death of child" notices. Suitable adjustment of eventpriorities could eliminate the window of possibility in the "don't havestonith-configured" case.

I certainly wouldn't ever stop encouraging people to configure and useSTONITH.

There are numerous good reasons not to use ssh stonith in production.It is not reliable, only works in a development environment, and IMHOcan't be made reliable (I spent some time trying when I wrote it), andrelies on having ssh and at installed and ssh ports open inbound andoutbound, and having "at" running. It's just too fragile.

In fact, it's almost impossible to write a stonith of this form and haveit both work reliably and report on its success reliably. After all, ifit waits until it succeeds to report success, then it's not there to dothe reporting. This is why the current code uses "at".


I don't believe that the ssh stonith approach is going to work.

In addition, your suggestion suffers from the "top of the stack"reliability problem I mentioned in my previous email. The lower in thestack that this happens, the fewer components are involved, and the morereliable the result. The higher in the stack you try and make this, themore things have to be working, and the less reliable the result.

Both your approach and mine are reasonably fail-fast. As a failurerecovery mechanism however, recovering reliably is more important thanexactly how fast the code fails in these error cases. The fewer thingsthat have to work the more reliable it is. Given how many componentshave to work for the failure to be detected, reported, decision made,and actions queued up and carried out, the difference in recoveryfailure probabilities differ by several orders of magnitude.

To put this in perspective, what we're arguing over is how to implementmethod (a) from my previous reply to Kevin Tomlinson.


So, I don't hear you arguing for a general approach of (b), (c), or (d).

--
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship... Let meclaim from you at all times your undisguised opinions." - WilliamWilberforce

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to