Re: [Linux-HA] failcount set to INFINITY (1000000) if monitor returns rc=7

Andrew Beekhof Mon, 21 Sep 2009 04:14:19 -0700

On Mon, Sep 21, 2009 at 12:56 PM, Dejan Muhamedagic <[email protected]> wrote:
> Hi,
>
> On Mon, Sep 21, 2009 at 12:14:25PM +0200, Andrew Beekhof wrote:
>> On Mon, Sep 21, 2009 at 11:39 AM, Dejan Muhamedagic <[email protected]> 
>> wrote:
>> > Hi,
>> >
>> > On Mon, Sep 21, 2009 at 11:15:51AM +0200, Andrew Beekhof wrote:
>> >> On Fri, Sep 18, 2009 at 12:52 PM, Enno Gröper
>> >> <[email protected]> wrote:
>> >> > Hi,
>> >> > I'm using pacemaker with heartbeat to run a 2 node dhcp server cluster
>> >> > with shared disk using drbd for the lease file.
>> >> > After upgrading from using heartbeat 2.1.3 (lenny packages) alone (I
>> >> > purged the old install and removed rest of the old files by hand) I have
>> >> > some strange problems.
>> >> > When stopping the monitored dhcp service using "/etc/init.d/dhcp3-server
>> >> > stop" pacemaker recognises this as expected, but instead of simply
>> >> > trying to restart the resource on the same node it leaves it stopped
>> >> > (the other node is in standby mode).
>> >> > To achieve what I want (and what I think was default behaviour using
>> >> > heartbeat 2.1.3) I set migration_threshold to 1.
>> >> > However failcount is set to INFINITY instead of being increased by 1 so
>> >> > this doesn't matter.
>> >> > I thougt failcount is only set to INFINITY if failures occur on starting
>> >> > a resource?
>> >>
>> >> With migration-threshold = 1, _any_ failure will force the resource to
>> >> another node.
>> >> Including monitor failures.
>> >
>> > And if the other node is in standby then the resource remains
>> > down. I still find that counterintuitive.
>>
>> I don't see why.
>> I get that it might not be what you want, but its a logical consequence of
>>   If the resource fails N times on nodeX it cant run on nodeX
>
> OK. Then migration-threshold should have been named max-fail or
> something along that line.
>
>> > To put it differently:
>> > How to configure pacemaker to always do a failover to another
>> > node, but to restart the resource in case other nodes are not
>> > available.
>>
>> if a small delay is acceptable, then you can use failure-timeout.
>>
>> But seriously, if the existing node could still host the resource
>> after a single failure, then why force it to move under any condition?
>> What benefit do you get from this?
>> Basically I'd suggest "1" is the wrong value for migration-threshold
>> in this case.  Set it to 2 to see if a restart helps and if not _then_
>> force it off (if the other node is down, subsequent restarts are
>> unlikely to be helpful in the immediate term).
>
> What if there's a transient error condition for whatever reason
> which makes a resource fail in quick succession on all nodes.
> To me it looks like that long failure-timeout in combination with
> small migration-threshold is not a viable configuration.


So use a small failure-timeout too.
Nothing wrong with that, just make it larger than however long the
resource takes to start up.

> Permanent errors such as ERR_CONFIGURED or ERR_INSTALLED bump the
> failure-count to INFINITY which then prevents pengine from
> scheduling start for that resource forever.

Yeah, but then we're not talking about transient errors anymore are we.

And actually you're totally wrong here.

The error code returned has no effect on what fail-count is set to and
the start-failure-is-fatal option will tell the cluster to increment
fail-count instead of setting it to infinity.
Stop failures will end up with the node being fenced anyway, so its
largely irrelevant what fail-count is set to.

> The
> migration-threshold basically lowers that limit for transient
> errors until the failure-timeout expires. I still think that the
> migration (failover threshold) concept should be somehow
> decoupled from the maximum failure count.

What would the maximum failure count do then?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] failcount set to INFINITY (1000000) if monitor returns rc=7

Reply via email to