Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Andrew Beekhof Wed, 07 Nov 2007 01:21:01 -0800


On Nov 7, 2007, at 5:51 AM, Alan Robertson wrote:

Andrew Beekhof wrote:
On Nov 6, 2007, at 9:46 PM, Alan Robertson wrote:
Andrew Beekhof wrote:
On Nov 6, 2007, at 6:25 PM, Alan Robertson wrote:
We now have the ComponentFail test in CTS. Thanks Lars forgetting it going!
And, in the process, it's showing up some kinds of problems thatwe hadn't been looking for before. A couple examples of suchproblems can be found here:
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762
It is very rare for a stonith action to be actually initiated inthis case.But having stonith disabled results in very dangerous yetunavoidable assumptions being made.
Which is why stonith is so highly encouraged.
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732

The question that comes up is this:
For problems that should "never" happen like death of one of ourcore/key processes, is an immediate reboot of the machine theright recovery technique?
The advantages of such a choice include:
It is fast
It will invoke recovery paths that we exercise a lot in testing
It is MUCH simpler than trying to recover from all these cases,
  therefore almost certainly more reliable

The disadvantages of such a choice include:
It is crude, and very annoying
It probably shouldn't be invoked for single-node clusters (?)
It could be criticized as being lazy
It shouldn't be invoked if there is another simple and correctmethod
Continual rebooting becomes a possibility...
Assuming continual re-failure of one of our processes, yes.
We do not have a policy of doing this throughout the project,what we have is a few places where we do it.
I propose that we should consider making a uniform policydecision for the project - and specifically decide to useungraceful reboots as our recovery method for "key" processesdying (for example: CCM, heartbeat, CIB, CRM). It should workfor those cases where people don't configure in watchdogs orexplicitly define any STONITH devices, and also independently ofquorum policies - because AFAIK it seems like the right choice,there's no technical reason not to do so.My inclination is to think that this is a good approach to takefor problems that in our best-guess judgment "shouldn't happen".
I dislike it for the reason that node suicide provides a falsesense of security.You end up making the window of opportunity for "something bad"to happen smaller, but it still exists.
If you have STONITH configured, the two methods are equally safe.
I can't agree, STONITH is a clear winner.
You get more options in case of things like failed stops and a 100%guarantee that resources will never be active on more than one node.
If you don't have STONITH configured, then my suggested approachis significantly superior.
Superior than nothing isn't a tough-ask.
Something you can't screw up has great merit.
The window for damage is very small - heartbeat is a realtimeprocess, and it is also the same process that is sending out the"death of child" notices. Suitable adjustment of event prioritiescould eliminate the window of possibility in the "don't havestonith-configured" case.
You can't eliminate it.
The death of the machine is not instantaneous and in that time therest of the cluster _could_ have started the resource for thesecond time.
I think you're missing something here...
The way the other nodes are notified when a process dies - so thatthey can do something about it - is that Heartbeat tells them.
It won't tell them about death-of-a-client if it suicides first. IfHeartbeat doesn't tell them the process has died, then they won'tknow. If heartbeat processes death-of-child before it processes theclose of the socket (which we have control over), then there is nowindow of opportunity for it to fail.


True.

This is only for the case where we know something has failedourselves. Not for the cases where the node doesn't know somethingis wrong. In the cases where we know something is wrong, we alwaysget to choose how to notify others (if at all).
You can reduce the possibility to almost nothing, but its still notnothing.
I don't know of any holes in my description above. If you have aspecific counter-example, please present it.
I certainly wouldn't ever stop encouraging people to configure anduse STONITH.
There are numerous good reasons not to use ssh stonith inproduction. It is not reliable, only works in a developmentenvironment,
Actually there a plenty of people using it today.
I'd much prefer they had a real device, but they are aware of therisks and seem happy enough.
Like lots of cases, people are happy enough until they get burned.Just like the people with shared storage and no STONITH who arehappy enough until they get bit. And, there are plenty of thosefolks too - probably quite a few more.
and IMHO can't be made reliable (I spent some time trying when Iwrote it), and relies on having ssh and at installed and ssh portsopen inbound and outbound, and having "at" running.
These are all things one can verify beforehand. They're notreasons to invent something new - after-all one can alsomisconfigure a real stonith device.
And they are requirements which we don't have to impose on anyone torecover from this kind of error. So, the complexity inconfiguration is high - and the chances for screwup are high. Sincethis won't be used in anything except the most limited ofcircumstances, the chances of a configuration error going undetectedare high.

Its not that high - the howto guide wouldn't be very long orcomplicated and there are easy steps one can take to verify thatyou've set it up correctly.

Where ssh does have problems is when the node is sick, but it is noless reliable than suicide in such situations
There are about 10 different components on two different machinesthat have to work right

It already has to work right in a whole multitude of cases besidesthis scenario.It already _does_ work in this scenario and I wrote the ComponentFailtest to make sure it continues to work.

- and if it's the DC that's got the problem, it probably won't workat all - it looks like Lars may have rigged it to avoid that case inthe test code.


It will work and no, he hasn't rigged it.

In the suicide case, there is exactly one system call that has towork right. And, in the end, the STONITH will still get used ifit's been configured. And, if it hasn't, things still work.
It's just too fragile.
In fact, it's almost impossible to write a stonith of this formand have it both work reliably and report on its successreliably. After all, if it waits until it succeeds to reportsuccess, then it's not there to do the reporting. This is why thecurrent code uses "at".
I don't believe that the ssh stonith approach is going to work.
If that were true then it wouldn't work for real stonith deviceseither.Its no co-incidence that you don't see the same problems when usingCTS with stonith enabled.
Because we faked up the ssh stonith code to make some guesses thatare suitable for testing,

No.

If the ssh plugin wasn't helping then you'd also see "resource activeon multiple hosts" errors when stonith was enabled.

but not really what you'd call wonderful.
In addition, your suggestion suffers from the "top of the stack"reliability problem I mentioned in my previous email. The lowerin the stack that this happens, the fewer components are involved,and the more reliable the result. The higher in the stack you tryand make this, the more things have to be working, and the lessreliable the result.
Um... you're bashing your own fencing subsystem here.
If its reliable enough for all other the reasons it was createdfor, then it is also reliable for this scenario.
No, I'm simply saying:  simpler beats more complex.

And it's not just the stonith subsystem, it's Heartbeat, stonith,the CRM, the tengine, the pengine, etc. Which one(s) do youpostulate has the error(s)? On the DC? Not on the DC?


huh?

This is in addition to the recovery mechanism you like. It doesn'tget rid of it - it is a very small amount of code which provides asecond method for recovering.
With the advantage that there is a single code-path and zero newlines of code.
Not true.  It doesn't work now, unless you have STONITH enabled.

Of course it doesn't work if stonith is enabled. Thats not thealternative anyone is pushing here.

And, there are cases where STONITH isn't available - or may fail.And, the code (which is now finished) is a very small amount of verysimple code - with a much shorter execution path.

Ok, so we're having a discussion but the decision has already beenmade. Fantastic.Which processes is it enabled for? Can it at least be turned off whenstonith is enabled?

Unless you're talking about the v1 resource manager, in which caseI'll completely butt out of the conversation.
Both your approach and mine are reasonably fail-fast. As afailure recovery mechanism however, recovering reliably is moreimportant than exactly how fast the code fails in these errorcases. The fewer things that have to work the more reliable itis. Given how many components have to work for the failure to bedetected, reported, decision made, and actions queued up andcarried out, the difference in recovery failure probabilitiesdiffer by several orders of magnitude.
To put this in perspective, what we're arguing over is how toimplement method (a) from my previous reply to Kevin Tomlinson.
So, I don't hear you arguing for a general approach of (b), (c),or (d).
Actually what I'm saying is that most of (d) already happens^ andthat the only "missing" piece is enabling STONITH so the clustercan do something about it.
No, (d) stops anything from happening - including STONITH - and thehuge number of race conditions that are involved in all thedifferent ways it can happen boggle the mind.


This is just wrong.

I'm also arguing that any implementation of (a) can be no morereliable than simply adding an ssh STONITH agent and is inherentlyless safe^^
If you think it's inherently less safe,

I agreed above with your explanation. If i think of anything I'll besure to mention it.

then please give a specific example to support this statement. Idon't understand how having two protection mechanisms is somehowless safe than having only one of the two. Actually it's
        one protection mechanism which doesn't require configuration/
        one protection mechanism which works when properly configured
^ The "propagate the failure to the top layer of the cluster" part
^^ because the subsystem put in charge of managing resources isoperating in parallel and without any idea of what's going on
The failure gets propagated to the top of the stack after thereboot. Then it gets recovered from by the CRM, et al. The rebootcertainly doesn't do the recovery.
This doesn't get rid of your recovery code. It just notifies it todo the recovery in a slightly different way.
Of course, you always have the option of recovering yourself withoutrequiring STONITH. That's method (b) and it's a great method. Feelfree to avail yourself of it. This is what Lars wants me to do inthe comm case. It's clearly the best choice - far better than anyof the others.
With built-in reboot:
        with correctly configured stonith - it works
        without stonith - it works

Without built-in reboot:
        with correctly configured stonith - it works
        without correctly configured stonith - it destroys data


Please stop suggesting that _anyone_ is promoting this.

I opt to protect the data.

Really? Personally I'd much prefer to see everyone's data burn infires of hell. </sarcasm>

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to