On 2007-12-19T11:32:12, Andrew Beekhof <[EMAIL PROTECTED]> wrote:

> i prefer to use the "crm respawn" directive which disables the fast-fail 
> logic^.
> when a non-transient problem like this occurs and heartbeat is started at 
> boot time (which is the normal thing to do), you have about 2s to identify 
> and fix the problem before the node reboots again
>
> personally, i find this timeframe unrealistic

This is not about "identifying" the problem, but about quickly resolving
transient errors.

If, as in this case, the problem isn't transient, well ...

Fast-fail is the right approach. I'd argue that the saner default might
be to use fast-fail to cause a "crash" (including a crashdump for
debugging) instead of entering a reboot loop, yes.

(Combined with STONITH, the other nodes still might decide to reboot the
node; possibly allowing enough time for it to actually dump would be
saner still.)

Fast-fail clearly is the right direction to take, though.



Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to