Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Kevin Tomlinson Tue, 06 Nov 2007 10:09:14 -0800

On Tue, 2007-11-06 at 10:25 -0700, Alan Robertson wrote:

> We now have the ComponentFail test in CTS.  Thanks Lars for getting it 
> going!
> 
> And, in the process, it's showing up some kinds of problems that we 
> hadn't been looking for before.  A couple examples of such problems can 
> be found here:
> 
> http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762
> http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732
> 
> The question that comes up is this:
> 
> For problems that should "never" happen like death of one of our 
> core/key processes, is an immediate reboot of the machine the right 
> recovery technique?
> 
> The advantages of such a choice include:
>   It is fast
>   It will invoke recovery paths that we exercise a lot in testing
>   It is MUCH simpler than trying to recover from all these cases,
>       therefore almost certainly more reliable
> 
> The disadvantages of such a choice include:
>   It is crude, and very annoying
>   It probably shouldn't be invoked for single-node clusters (?)
>   It could be criticized as being lazy
>   It shouldn't be invoked if there is another simple and correct method
>   Continual rebooting becomes a possibility...
> 
> We do not have a policy of doing this throughout the project, what we 
> have is a few places where we do it.
> 
> I propose that we should consider making a uniform policy decision for 
> the project - and specifically decide to use ungraceful reboots as our 
> recovery method for "key" processes dying (for example: CCM, heartbeat, 
> CIB, CRM).  It should work for those cases where people don't configure 
> in watchdogs or explicitly define any STONITH devices, and also 
> independently of quorum policies - because AFAIK it seems like the right 
> choice, there's no technical reason not to do so.
> 
> My inclination is to think that this is a good approach to take for 
> problems that in our best-guess judgment "shouldn't happen".
> 
> 
> I'm bringing this to both lists, so that we can hear comments both from
> developers and users.
> 
> 
> Comments please...
>



I would say the "right thing" would depend on your cluster
implementation and what is consider the right thing to do for the
applications that the cluster is monitoring.
I would propose that this action should be administrator configurable.
>From a user point of view with the cluster that we are implementing we
would expect any cluster failure (internal) to either get itself back
and running or just send out an alert "Help me. im not working"... as we
would want our applications to continue running on the nodes. ** We dont
want a service outage just because the cluster is no longer monitoring
our applications. **
We would expect to get a 24x7 call out. Sev1 and then logon to the
cluster and see what was happening. (configured alerting)
Our applications only want a service outage if the node itself has
issues not the Cluster..

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to