On Mon, Sep 21, 2009 at 1:41 PM, Dejan Muhamedagic <[email protected]> wrote:
> On Mon, Sep 21, 2009 at 01:14:05PM +0200, Andrew Beekhof wrote:
>> On Mon, Sep 21, 2009 at 12:56 PM, Dejan Muhamedagic <[email protected]> 
>> wrote:
>> > Hi,
>> >
>> > What if there's a transient error condition for whatever reason
>> > which makes a resource fail in quick succession on all nodes.
>> > To me it looks like that long failure-timeout in combination with
>> > small migration-threshold is not a viable configuration.
>>
>> So use a small failure-timeout too.
>> Nothing wrong with that, just make it larger than however long the
>> resource takes to start up.
>>
>> > Permanent errors such as ERR_CONFIGURED or ERR_INSTALLED bump the
>> > failure-count to INFINITY which then prevents pengine from
>> > scheduling start for that resource forever.
>>
>> Yeah, but then we're not talking about transient errors anymore are we.
>>
>> And actually you're totally wrong here.
>>
>> The error code returned has no effect on what fail-count is set to and
>
> How does pengine know not to start a resource on a node if it's
> not properly configured (ERR_CONFIGURED) or prerequisites
> installed (ERR_INSTALLED)? Just by looking at the error code?

Correct

> I thought I saw that fail-count was set to INFINITY on such
> failures.

Not in recent history.

>> the start-failure-is-fatal option will tell the cluster to increment
>> fail-count instead of setting it to infinity.
>> Stop failures will end up with the node being fenced anyway, so its
>> largely irrelevant what fail-count is set to.
>
> Of course. This was just about monitor and start failures.
>
>> > The
>> > migration-threshold basically lowers that limit for transient
>> > errors until the failure-timeout expires. I still think that the
>> > migration (failover threshold) concept should be somehow
>> > decoupled from the maximum failure count.
>>
>> What would the maximum failure count do then?
>
> That's what we're talking about: the migration-threshold prevents
> the resource from starting on a node. So, it sounds more like
> maximum failure count to me. Anyway, given the circumstances, I
> guess that the failure-timeout should be taken set depending on
> the migration-threshold, perhaps something like this:
>
>        failure-timeout = n * migration-threshold * max-timeout + retry_wait
>
> where n is the number of nodes, max-timeout is the maximum of
> start and monitor timeouts, and retry_wait is some period for
> which the CRM should wait until trying to start resources again.

Thats a pretty good guide. Not sure I'd want to make that the default though.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to