On 2007-11-06T10:25:05, Alan Robertson <[EMAIL PROTECTED]> wrote:
> For problems that should "never" happen like death of one of our core/key
> processes, is an immediate reboot of the machine the right recovery
> technique?
>
> The advantages of such a choice include:
> It is fast
> It will invoke recovery paths that we exercise a lot in testing
> It is MUCH simpler than trying to recover from all these cases,
> therefore almost certainly more reliable
FailFast / self-fencing is certainly a good default. We can, for
selective processes, always get more fancy.
I'd be happy with FailFast for the core processes, if we get better
recovery for the network-facing processes, possibly stonithd - at least
as long as it executes plugins within its own context.
An alternative is an immediate restart of the whole cluster processes
locally, but that can cause fluctuations as well.
My suggestion would be to combine this with the watchdog system to
trigger a reboot, or to simply stop heartbeating and rely on the other
nodes to shoot us.
> The disadvantages of such a choice include:
> It is crude, and very annoying
It's not very annoying; it means that the machine is beyond repair,
anyway.
> It probably shouldn't be invoked for single-node clusters (?)
It's similar to killing init, which will also reboot the machine. No big
deal.
Regards,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems