Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Andrew Beekhof Tue, 06 Nov 2007 11:07:14 -0800


On Nov 6, 2007, at 7:45 PM, Alan Robertson wrote:

Kevin Tomlinson wrote:
On Tue, 2007-11-06 at 10:25 -0700, Alan Robertson wrote:
We now have the ComponentFail test in CTS. Thanks Lars forgetting it going!


Actually that was me.  Try "hg annotate"

And, in the process, it's showing up some kinds of problems thatwe hadn't been looking for before. A couple examples of suchproblems can be found here:
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732

The question that comes up is this:
For problems that should "never" happen like death of one of ourcore/key processes, is an immediate reboot of the machine theright recovery technique?
The advantages of such a choice include:
 It is fast
 It will invoke recovery paths that we exercise a lot in testing
 It is MUCH simpler than trying to recover from all these cases,
        therefore almost certainly more reliable

The disadvantages of such a choice include:
 It is crude, and very annoying
 It probably shouldn't be invoked for single-node clusters (?)
 It could be criticized as being lazy
It shouldn't be invoked if there is another simple and correctmethod
 Continual rebooting becomes a possibility...
We do not have a policy of doing this throughout the project, whatwe have is a few places where we do it.
I propose that we should consider making a uniform policy decisionfor the project - and specifically decide to use ungracefulreboots as our recovery method for "key" processes dying (forexample: CCM, heartbeat, CIB, CRM). It should work for thosecases where people don't configure in watchdogs or explicitlydefine any STONITH devices, and also independently of quorumpolicies - because AFAIK it seems like the right choice, there'sno technical reason not to do so.
My inclination is to think that this is a good approach to takefor problems that in our best-guess judgment "shouldn't happen".
I'm bringing this to both lists, so that we can hear comments bothfrom
developers and users.


Comments please...
I would say the "right thing" would depend on your cluster
implementation and what is consider the right thing to do for the
applications that the cluster is monitoring.
I would propose that this action should be administratorconfigurable.
From a user point of view with the cluster that we areimplementing we
would expect any cluster failure (internal) to either get itself back
and running or just send out an alert "Help me. im not working"...as wewould want our applications to continue running on the nodes. ** Wedontwant a service outage just because the cluster is no longermonitoring
our applications. **
We would expect to get a 24x7 call out. Sev1 and then logon to the
cluster and see what was happening. (configured alerting)
Our applications only want a service outage if the node itself has
issues not the Cluster..
Here's the issue:

The solution as I see it is to do one of:

        a) reboot the node and clear the problem with certainty

        b) continue on and risk damaging your disks.

        c) write some new code to recover from specific cases more
           gracefully and then test it thoroughly.

        d) Try and figure out how to propagate the failure to the
                top layer of the cluster, and hope you get the notice
                there soon enough so that it can "freeze" the cluster
                before the code reacts to the apparent failure
                and begins to try and recover from it.
In the current code, sometimes you'll get behavior (a) and sometimesyou'll get behavior (b) and sometimes you'll get behavior (c).
In the particular case described by bug 1762, failure to reboot thenode did indeed start the same resource twice.

As stated in my previous reply, rebooting doesn't prevent the resourcefrom being started twice, it just makes it less likely - which isdefinitly not the same thing.

In a cluster where you have shared disk (like yours for example),that would probably trash the filesystem. Not a good plan unlessyou're tired of your current job ;-). I'd like to take most/all ofthe cases where you might get behavior (b) and cause them to usebehavior (a).
If writing correct code and testing it were free, then (c) wouldobviously be the right choice.
Quite honestly, I don't know how to do (d) in a reliable way at all.It's much more difficult than it sounds. Among other reasons, itrelies on the components you're telling to freeze things to workcorrectly. Since resource freezes happen at the top level of thesystem, and the top layers need all the layers under them to workcorrectly, getting this right seems to be the kind of approach youcould make into your life's work - and still never get it right.
Case (c) has to be handled on a case by case basis, where you writeand test the code for a particular failure case. IMHO the onlyfeasible _general_ answer is (a).


you forgot

e) configure stonith - even the ssh or metaware plugins are betterthan nothing

There are an infinite number of things that can go wrong. So,having a reliable and general strategy to deal with the WTF's of theworld is a good thing. Of course, for those cases where we have a(c) behavior would not be affected by this change in general policy.
--
   Alan Robertson <[EMAIL PROTECTED]>
"Openness is the foundation and preservative of friendship... Letme claim from you at all times your undisguised opinions." - WilliamWilberforce
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to