Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Andrew Beekhof Tue, 06 Nov 2007 10:26:15 -0800


On Nov 6, 2007, at 6:25 PM, Alan Robertson wrote:

We now have the ComponentFail test in CTS. Thanks Lars for gettingit going!
And, in the process, it's showing up some kinds of problems that wehadn't been looking for before. A couple examples of such problemscan be found here:
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762

It is very rare for a stonith action to be actually initiated in thiscase.But having stonith disabled results in very dangerous yet unavoidableassumptions being made.


Which is why stonith is so highly encouraged.

http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732


The question that comes up is this:

For problems that should "never" happen like death of one of ourcore/key processes, is an immediate reboot of the machine the rightrecovery technique?


The advantages of such a choice include:
It is fast
It will invoke recovery paths that we exercise a lot in testing
It is MUCH simpler than trying to recover from all these cases,
        therefore almost certainly more reliable

The disadvantages of such a choice include:
It is crude, and very annoying
It probably shouldn't be invoked for single-node clusters (?)
It could be criticized as being lazy
It shouldn't be invoked if there is another simple and correct method

Continual rebooting becomes a possibility...


Assuming continual re-failure of one of our processes, yes.

We do not have a policy of doing this throughout the project, whatwe have is a few places where we do it.
I propose that we should consider making a uniform policy decisionfor the project - and specifically decide to use ungraceful rebootsas our recovery method for "key" processes dying (for example: CCM,heartbeat, CIB, CRM). It should work for those cases where peopledon't configure in watchdogs or explicitly define any STONITHdevices, and also independently of quorum policies - because AFAIKit seems like the right choice, there's no technical reason not todo so.My inclination is to think that this is a good approach to take forproblems that in our best-guess judgment "shouldn't happen".

I dislike it for the reason that node suicide provides a false senseof security.You end up making the window of opportunity for "something bad" tohappen smaller, but it still exists.

Personally I'd even favor using the ssh stonith module for the caseslike 1762 - provided it has the good sense to report failure if itcan't complete.Its certainly no less reliable than suicide and has the benefit ofbeing centrally controlled.

I'm bringing this to both lists, so that we can hear comments bothfrom
developers and users.


Comments please...

--
   Alan Robertson <[EMAIL PROTECTED]>
"Openness is the foundation and preservative of friendship... Letme claim from you at all times your undisguised opinions." - WilliamWilberforce
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to