Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Andrew Beekhof Tue, 06 Nov 2007 14:46:44 -0800


On Nov 6, 2007, at 9:46 PM, Alan Robertson wrote:

Andrew Beekhof wrote:
On Nov 6, 2007, at 6:25 PM, Alan Robertson wrote:
We now have the ComponentFail test in CTS. Thanks Lars forgetting it going!
And, in the process, it's showing up some kinds of problems thatwe hadn't been looking for before. A couple examples of suchproblems can be found here:
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762
It is very rare for a stonith action to be actually initiated inthis case.But having stonith disabled results in very dangerous yetunavoidable assumptions being made.
Which is why stonith is so highly encouraged.
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732

The question that comes up is this:
For problems that should "never" happen like death of one of ourcore/key processes, is an immediate reboot of the machine theright recovery technique?
The advantages of such a choice include:
It is fast
It will invoke recovery paths that we exercise a lot in testing
It is MUCH simpler than trying to recover from all these cases,
   therefore almost certainly more reliable

The disadvantages of such a choice include:
It is crude, and very annoying
It probably shouldn't be invoked for single-node clusters (?)
It could be criticized as being lazy
It shouldn't be invoked if there is another simple and correctmethod
Continual rebooting becomes a possibility...
Assuming continual re-failure of one of our processes, yes.
We do not have a policy of doing this throughout the project, whatwe have is a few places where we do it.
I propose that we should consider making a uniform policy decisionfor the project - and specifically decide to use ungracefulreboots as our recovery method for "key" processes dying (forexample: CCM, heartbeat, CIB, CRM). It should work for thosecases where people don't configure in watchdogs or explicitlydefine any STONITH devices, and also independently of quorumpolicies - because AFAIK it seems like the right choice, there'sno technical reason not to do so.My inclination is to think that this is a good approach to takefor problems that in our best-guess judgment "shouldn't happen".
I dislike it for the reason that node suicide provides a falsesense of security.You end up making the window of opportunity for "something bad" tohappen smaller, but it still exists.
If you have STONITH configured, the two methods are equally safe.


I can't agree, STONITH is a clear winner.

You get more options in case of things like failed stops and a 100%guarantee that resources will never be active on more than one node.

If you don't have STONITH configured, then my suggested approach issignificantly superior.


Superior than nothing isn't a tough-ask.

The window for damage is very small - heartbeat is a realtimeprocess, and it is also the same process that is sending out the"death of child" notices. Suitable adjustment of event prioritiescould eliminate the window of possibility in the "don't have stonith-configured" case.


You can't eliminate it.

The death of the machine is not instantaneous and in that time therest of the cluster _could_ have started the resource for the secondtime.

You can reduce the possibility to almost nothing, but its still notnothing.

I certainly wouldn't ever stop encouraging people to configure anduse STONITH.
There are numerous good reasons not to use ssh stonith inproduction. It is not reliable, only works in a developmentenvironment,


Actually there a plenty of people using it today.

I'd much prefer they had a real device, but they are aware of therisks and seem happy enough.

and IMHO can't be made reliable (I spent some time trying when Iwrote it), and relies on having ssh and at installed and ssh portsopen inbound and outbound, and having "at" running.

These are all things one can verify beforehand. They're not reasonsto invent something new - after-all one can also misconfigure a realstonith device.

Where ssh does have problems is when the node is sick, but it is noless reliable than suicide in such situations

It's just too fragile.
In fact, it's almost impossible to write a stonith of this form andhave it both work reliably and report on its success reliably.After all, if it waits until it succeeds to report success, thenit's not there to do the reporting. This is why the current codeuses "at".
I don't believe that the ssh stonith approach is going to work.


If that were true then it wouldn't work for real stonith devices either.

Its no co-incidence that you don't see the same problems when usingCTS with stonith enabled.

In addition, your suggestion suffers from the "top of the stack"reliability problem I mentioned in my previous email. The lower inthe stack that this happens, the fewer components are involved, andthe more reliable the result. The higher in the stack you try andmake this, the more things have to be working, and the less reliablethe result.


Um... you're bashing your own fencing subsystem here.

If its reliable enough for all other the reasons it was created for,then it is also reliable for this scenario.

With the advantage that there is a single code-path and zero new linesof code.

Unless you're talking about the v1 resource manager, in which caseI'll completely butt out of the conversation.

Both your approach and mine are reasonably fail-fast. As a failurerecovery mechanism however, recovering reliably is more importantthan exactly how fast the code fails in these error cases. Thefewer things that have to work the more reliable it is. Given howmany components have to work for the failure to be detected,reported, decision made, and actions queued up and carried out, thedifference in recovery failure probabilities differ by severalorders of magnitude.
To put this in perspective, what we're arguing over is how toimplement method (a) from my previous reply to Kevin Tomlinson.
So, I don't hear you arguing for a general approach of (b), (c), or(d).

Actually what I'm saying is that most of (d) already happens^ and thatthe only "missing" piece is enabling STONITH so the cluster can dosomething about it.I'm also arguing that any implementation of (a) can be no morereliable than simply adding an ssh STONITH agent and is inherentlyless safe^^


^ The "propagate the failure to the top layer of the cluster" part

^^ because the subsystem put in charge of managing resources isoperating in parallel and without any idea of what's going on


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to