Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Kevin Tomlinson Tue, 06 Nov 2007 14:41:30 -0800

On Tue, 2007-11-06 at 13:46 -0700, Alan Robertson wrote:

> Andrew Beekhof wrote:
> > 
> > On Nov 6, 2007, at 6:25 PM, Alan Robertson wrote:
> > 
> >> We now have the ComponentFail test in CTS.  Thanks Lars for getting it 
> >> going!
> >>
> >> And, in the process, it's showing up some kinds of problems that we 
> >> hadn't been looking for before.  A couple examples of such problems 
> >> can be found here:
> >>
> >> http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762
> > 
> > It is very rare for a stonith action to be actually initiated in this case.
> > But having stonith disabled results in very dangerous yet unavoidable 
> > assumptions being made.
> > 
> > Which is why stonith is so highly encouraged.
> > 
> >>
> >> http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732
> >>
> >> The question that comes up is this:
> >>
> >> For problems that should "never" happen like death of one of our 
> >> core/key processes, is an immediate reboot of the machine the right 
> >> recovery technique?
> >>
> >> The advantages of such a choice include:
> >> It is fast
> >> It will invoke recovery paths that we exercise a lot in testing
> >> It is MUCH simpler than trying to recover from all these cases,
> >>     therefore almost certainly more reliable
> >>
> >> The disadvantages of such a choice include:
> >> It is crude, and very annoying
> >> It probably shouldn't be invoked for single-node clusters (?)
> >> It could be criticized as being lazy
> >> It shouldn't be invoked if there is another simple and correct method
> >>
> >> Continual rebooting becomes a possibility...
> > 
> > Assuming continual re-failure of one of our processes, yes.
> > 
> >> We do not have a policy of doing this throughout the project, what we 
> >> have is a few places where we do it.
> >>
> >> I propose that we should consider making a uniform policy decision for 
> >> the project - and specifically decide to use ungraceful reboots as our 
> >> recovery method for "key" processes dying (for example: CCM, 
> >> heartbeat, CIB, CRM).  It should work for those cases where people 
> >> don't configure in watchdogs or explicitly define any STONITH devices, 
> >> and also independently of quorum policies - because AFAIK it seems 
> >> like the right choice, there's no technical reason not to do so.
> >> My inclination is to think that this is a good approach to take for 
> >> problems that in our best-guess judgment "shouldn't happen".
> > 
> > I dislike it for the reason that node suicide provides a false sense of 
> > security.
> > You end up making the window of opportunity for "something bad" to 
> > happen smaller, but it still exists.
> 
> If you have STONITH configured, the two methods are equally safe.  If 
> you don't have STONITH configured, then my suggested approach is 
> significantly superior.  The window for damage is very small - heartbeat 
> is a realtime process, and it is also the same process that is sending 
> out the "death of child" notices.  Suitable adjustment of event 
> priorities could eliminate the window of possibility in the "don't have 
> stonith-configured" case.
> 
> I certainly wouldn't ever stop encouraging people to configure and use 
> STONITH.
> 
> There are numerous good reasons not to use ssh stonith in production. 
> It is not reliable, only works in a development environment, and IMHO 
> can't be made reliable (I spent some time trying when I wrote it), and 
> relies on having ssh and at installed and ssh ports open inbound and 
> outbound, and having "at" running.  It's just too fragile.
> 
> In fact, it's almost impossible to write a stonith of this form and have 
> it both work reliably and report on its success reliably.  After all, if 
> it waits until it succeeds to report success, then it's not there to do 
> the reporting.  This is why the current code uses "at".
> 
> I don't believe that the ssh stonith approach is going to work.
> 
> In addition, your suggestion suffers from the "top of the stack" 
> reliability problem I mentioned in my previous email.  The lower in the 
> stack that this happens, the fewer components are involved, and the more 
> reliable the result.  The higher in the stack you try and make this, the 
> more things have to be working, and the less reliable the result.
> 
> Both your approach and mine are reasonably fail-fast.  As a failure 
> recovery mechanism however, recovering reliably is more important than 
> exactly how fast the code fails in these error cases.  The fewer things 
> that have to work the more reliable it is.  Given how many components 
> have to work for the failure to be detected, reported, decision made, 
> and actions queued up and carried out, the difference in recovery 
> failure probabilities differ by several orders of magnitude.
> 
> To put this in perspective, what we're arguing over is how to implement 
> method (a) from my previous reply to Kevin Tomlinson.
> 
> So, I don't hear you arguing for a general approach of (b), (c), or (d).
>



I know i dont fully understand all the complexities and issues around
"unexpected bad things". Im just adding my two pennies in to all your
developers that if its an internal cluster issue on a node and not the
actual nodes resources thats had an issue (may be you cant tell easily
as i can see from the discussions) then i as an administrator and
service provider would want the behaviour that allow my resources to be
kept running to minimize service outages. I know i dont fully see the
complications here but any attempt to keep service how ever complicated
code can get is sometimes worth it over service failure and node
reboots.

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to