Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Alan Robertson Tue, 06 Nov 2007 20:51:32 -0800

Andrew Beekhof wrote:

On Nov 6, 2007, at 9:46 PM, Alan Robertson wrote:
Andrew Beekhof wrote:
On Nov 6, 2007, at 6:25 PM, Alan Robertson wrote:
We now have the ComponentFail test in CTS. Thanks Lars for gettingit going!
And, in the process, it's showing up some kinds of problems that wehadn't been looking for before. A couple examples of such problemscan be found here:
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762
It is very rare for a stonith action to be actually initiated in thiscase.But having stonith disabled results in very dangerous yet unavoidableassumptions being made.
Which is why stonith is so highly encouraged.
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732

The question that comes up is this:
For problems that should "never" happen like death of one of ourcore/key processes, is an immediate reboot of the machine the rightrecovery technique?
The advantages of such a choice include:
It is fast
It will invoke recovery paths that we exercise a lot in testing
It is MUCH simpler than trying to recover from all these cases,
   therefore almost certainly more reliable

The disadvantages of such a choice include:
It is crude, and very annoying
It probably shouldn't be invoked for single-node clusters (?)
It could be criticized as being lazy
It shouldn't be invoked if there is another simple and correct method

Continual rebooting becomes a possibility...
Assuming continual re-failure of one of our processes, yes.
We do not have a policy of doing this throughout the project, whatwe have is a few places where we do it.
I propose that we should consider making a uniform policy decisionfor the project - and specifically decide to use ungraceful rebootsas our recovery method for "key" processes dying (for example: CCM,heartbeat, CIB, CRM). It should work for those cases where peopledon't configure in watchdogs or explicitly define any STONITHdevices, and also independently of quorum policies - because AFAIKit seems like the right choice, there's no technical reason not todo so.My inclination is to think that this is a good approach to take forproblems that in our best-guess judgment "shouldn't happen".
I dislike it for the reason that node suicide provides a false senseof security.You end up making the window of opportunity for "something bad" tohappen smaller, but it still exists.
If you have STONITH configured, the two methods are equally safe.
I can't agree, STONITH is a clear winner.
You get more options in case of things like failed stops and a 100%guarantee that resources will never be active on more than one node.
If you don't have STONITH configured, then my suggested approach issignificantly superior.
Superior than nothing isn't a tough-ask.


Something you can't screw up has great merit.

The window for damage is very small - heartbeat is a realtime process,and it is also the same process that is sending out the "death ofchild" notices. Suitable adjustment of event priorities couldeliminate the window of possibility in the "don't havestonith-configured" case.
You can't eliminate it.
The death of the machine is not instantaneous and in that time the restof the cluster _could_ have started the resource for the second time.


I think you're missing something here...

The way the other nodes are notified when a process dies - so that theycan do something about it - is that Heartbeat tells them.

It won't tell them about death-of-a-client if it suicides first. IfHeartbeat doesn't tell them the process has died, then they won't know.If heartbeat processes death-of-child before it processes the close ofthe socket (which we have control over), then there is no window ofopportunity for it to fail.

This is only for the case where we know something has failed ourselves.Not for the cases where the node doesn't know something is wrong. Inthe cases where we know something is wrong, we always get to choose howto notify others (if at all).

You can reduce the possibility to almost nothing, but its still notnothing.

I don't know of any holes in my description above. If you have aspecific counter-example, please present it.

I certainly wouldn't ever stop encouraging people to configure and useSTONITH.
There are numerous good reasons not to use ssh stonith in production.It is not reliable, only works in a development environment,
Actually there a plenty of people using it today.
I'd much prefer they had a real device, but they are aware of the risksand seem happy enough.

Like lots of cases, people are happy enough until they get burned. Justlike the people with shared storage and no STONITH who are happy enoughuntil they get bit. And, there are plenty of those folks too - probablyquite a few more.

and IMHO can't be made reliable (I spent some time trying when I wroteit), and relies on having ssh and at installed and ssh ports openinbound and outbound, and having "at" running.
These are all things one can verify beforehand. They're not reasons toinvent something new - after-all one can also misconfigure a realstonith device.

And they are requirements which we don't have to impose on anyone torecover from this kind of error. So, the complexity in configuration ishigh - and the chances for screwup are high. Since this won't be usedin anything except the most limited of circumstances, the chances of aconfiguration error going undetected are high.

Where ssh does have problems is when the node is sick, but it is no lessreliable than suicide in such situations

There are about 10 different components on two different machines thathave to work right - and if it's the DC that's got the problem, itprobably won't work at all - it looks like Lars may have rigged it toavoid that case in the test code. In the suicide case, there is exactlyone system call that has to work right. And, in the end, the STONITHwill still get used if it's been configured. And, if it hasn't, thingsstill work.

It's just too fragile.
In fact, it's almost impossible to write a stonith of this form andhave it both work reliably and report on its success reliably. Afterall, if it waits until it succeeds to report success, then it's notthere to do the reporting. This is why the current code uses "at".
I don't believe that the ssh stonith approach is going to work.
If that were true then it wouldn't work for real stonith devices either.
Its no co-incidence that you don't see the same problems when using CTSwith stonith enabled.

Because we faked up the ssh stonith code to make some guesses that aresuitable for testing, but not really what you'd call wonderful.

In addition, your suggestion suffers from the "top of the stack"reliability problem I mentioned in my previous email. The lower inthe stack that this happens, the fewer components are involved, andthe more reliable the result. The higher in the stack you try andmake this, the more things have to be working, and the less reliablethe result.
Um... you're bashing your own fencing subsystem here.
If its reliable enough for all other the reasons it was created for,then it is also reliable for this scenario.

No, I'm simply saying: simpler beats more complex. And it's not justthe stonith subsystem, it's Heartbeat, stonith, the CRM, the tengine,the pengine, etc. Which one(s) do you postulate has the error(s)? Onthe DC? Not on the DC?

This is in addition to the recovery mechanism you like. It doesn't getrid of it - it is a very small amount of code which provides a secondmethod for recovering.

With the advantage that there is a single code-path and zero new linesof code.

Not true. It doesn't work now, unless you have STONITH enabled. And,there are cases where STONITH isn't available - or may fail. And, thecode (which is now finished) is a very small amount of very simple code- with a much shorter execution path.

Unless you're talking about the v1 resource manager, in which case I'llcompletely butt out of the conversation.
Both your approach and mine are reasonably fail-fast. As a failurerecovery mechanism however, recovering reliably is more important thanexactly how fast the code fails in these error cases. The fewerthings that have to work the more reliable it is. Given how manycomponents have to work for the failure to be detected, reported,decision made, and actions queued up and carried out, the differencein recovery failure probabilities differ by several orders of magnitude.
To put this in perspective, what we're arguing over is how toimplement method (a) from my previous reply to Kevin Tomlinson.
So, I don't hear you arguing for a general approach of (b), (c), or (d).
Actually what I'm saying is that most of (d) already happens^ and thatthe only "missing" piece is enabling STONITH so the cluster can dosomething about it.

No, (d) stops anything from happening - including STONITH - and the hugenumber of race conditions that are involved in all the different ways itcan happen boggle the mind.

I'm also arguing that any implementation of (a) can be no more reliablethan simply adding an ssh STONITH agent and is inherently less safe^^

If you think it's inherently less safe, then please give a specificexample to support this statement. I don't understand how having twoprotection mechanisms is somehow less safe than having only one of thetwo. Actually it's

        one protection mechanism which doesn't require configuration/
        one protection mechanism which works when properly configured

^ The "propagate the failure to the top layer of the cluster" part
^^ because the subsystem put in charge of managing resources isoperating in parallel and without any idea of what's going on

The failure gets propagated to the top of the stack after the reboot.Then it gets recovered from by the CRM, et al. The reboot certainlydoesn't do the recovery.

This doesn't get rid of your recovery code. It just notifies it to dothe recovery in a slightly different way.

Of course, you always have the option of recovering yourself withoutrequiring STONITH. That's method (b) and it's a great method. Feelfree to avail yourself of it. This is what Lars wants me to do in thecomm case. It's clearly the best choice - far better than any of theothers.


With built-in reboot:
        with correctly configured stonith - it works
        without stonith - it works

Without built-in reboot:
        with correctly configured stonith - it works
        without correctly configured stonith - it destroys data

I opt to protect the data.

--
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship... Let meclaim from you at all times your undisguised opinions." - WilliamWilberforce

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to