Re: [Linux-HA] failcount set to INFINITY (1000000) if monitor returns rc=7

Dejan Muhamedagic Mon, 21 Sep 2009 03:56:47 -0700

Hi,

On Mon, Sep 21, 2009 at 12:14:25PM +0200, Andrew Beekhof wrote:
> On Mon, Sep 21, 2009 at 11:39 AM, Dejan Muhamedagic <[email protected]> 
> wrote:
> > Hi,
> >
> > On Mon, Sep 21, 2009 at 11:15:51AM +0200, Andrew Beekhof wrote:
> >> On Fri, Sep 18, 2009 at 12:52 PM, Enno Gröper
> >> <[email protected]> wrote:
> >> > Hi,
> >> > I'm using pacemaker with heartbeat to run a 2 node dhcp server cluster
> >> > with shared disk using drbd for the lease file.
> >> > After upgrading from using heartbeat 2.1.3 (lenny packages) alone (I
> >> > purged the old install and removed rest of the old files by hand) I have
> >> > some strange problems.
> >> > When stopping the monitored dhcp service using "/etc/init.d/dhcp3-server
> >> > stop" pacemaker recognises this as expected, but instead of simply
> >> > trying to restart the resource on the same node it leaves it stopped
> >> > (the other node is in standby mode).
> >> > To achieve what I want (and what I think was default behaviour using
> >> > heartbeat 2.1.3) I set migration_threshold to 1.
> >> > However failcount is set to INFINITY instead of being increased by 1 so
> >> > this doesn't matter.
> >> > I thougt failcount is only set to INFINITY if failures occur on starting
> >> > a resource?
> >>
> >> With migration-threshold = 1, _any_ failure will force the resource to
> >> another node.
> >> Including monitor failures.
> >
> > And if the other node is in standby then the resource remains
> > down. I still find that counterintuitive.
> 
> I don't see why.
> I get that it might not be what you want, but its a logical consequence of
>   If the resource fails N times on nodeX it cant run on nodeX


OK. Then migration-threshold should have been named max-fail or
something along that line.

> > To put it differently:
> > How to configure pacemaker to always do a failover to another
> > node, but to restart the resource in case other nodes are not
> > available.
> 
> if a small delay is acceptable, then you can use failure-timeout.
> 
> But seriously, if the existing node could still host the resource
> after a single failure, then why force it to move under any condition?
> What benefit do you get from this?
> Basically I'd suggest "1" is the wrong value for migration-threshold
> in this case.  Set it to 2 to see if a restart helps and if not _then_
> force it off (if the other node is down, subsequent restarts are
> unlikely to be helpful in the immediate term).

What if there's a transient error condition for whatever reason
which makes a resource fail in quick succession on all nodes.
To me it looks like that long failure-timeout in combination with
small migration-threshold is not a viable configuration.
Permanent errors such as ERR_CONFIGURED or ERR_INSTALLED bump the
failure-count to INFINITY which then prevents pengine from
scheduling start for that resource forever. The
migration-threshold basically lowers that limit for transient
errors until the failure-timeout expires. I still think that the
migration (failover threshold) concept should be somehow
decoupled from the maximum failure count.

Thanks,

Dejan

> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] failcount set to INFINITY (1000000) if monitor returns rc=7

Reply via email to