Hi, On Mon, Sep 21, 2009 at 12:14:25PM +0200, Andrew Beekhof wrote: > On Mon, Sep 21, 2009 at 11:39 AM, Dejan Muhamedagic <[email protected]> > wrote: > > Hi, > > > > On Mon, Sep 21, 2009 at 11:15:51AM +0200, Andrew Beekhof wrote: > >> On Fri, Sep 18, 2009 at 12:52 PM, Enno Gröper > >> <[email protected]> wrote: > >> > Hi, > >> > I'm using pacemaker with heartbeat to run a 2 node dhcp server cluster > >> > with shared disk using drbd for the lease file. > >> > After upgrading from using heartbeat 2.1.3 (lenny packages) alone (I > >> > purged the old install and removed rest of the old files by hand) I have > >> > some strange problems. > >> > When stopping the monitored dhcp service using "/etc/init.d/dhcp3-server > >> > stop" pacemaker recognises this as expected, but instead of simply > >> > trying to restart the resource on the same node it leaves it stopped > >> > (the other node is in standby mode). > >> > To achieve what I want (and what I think was default behaviour using > >> > heartbeat 2.1.3) I set migration_threshold to 1. > >> > However failcount is set to INFINITY instead of being increased by 1 so > >> > this doesn't matter. > >> > I thougt failcount is only set to INFINITY if failures occur on starting > >> > a resource? > >> > >> With migration-threshold = 1, _any_ failure will force the resource to > >> another node. > >> Including monitor failures. > > > > And if the other node is in standby then the resource remains > > down. I still find that counterintuitive. > > I don't see why. > I get that it might not be what you want, but its a logical consequence of > If the resource fails N times on nodeX it cant run on nodeX
OK. Then migration-threshold should have been named max-fail or something along that line. > > To put it differently: > > How to configure pacemaker to always do a failover to another > > node, but to restart the resource in case other nodes are not > > available. > > if a small delay is acceptable, then you can use failure-timeout. > > But seriously, if the existing node could still host the resource > after a single failure, then why force it to move under any condition? > What benefit do you get from this? > Basically I'd suggest "1" is the wrong value for migration-threshold > in this case. Set it to 2 to see if a restart helps and if not _then_ > force it off (if the other node is down, subsequent restarts are > unlikely to be helpful in the immediate term). What if there's a transient error condition for whatever reason which makes a resource fail in quick succession on all nodes. To me it looks like that long failure-timeout in combination with small migration-threshold is not a viable configuration. Permanent errors such as ERR_CONFIGURED or ERR_INSTALLED bump the failure-count to INFINITY which then prevents pengine from scheduling start for that resource forever. The migration-threshold basically lowers that limit for transient errors until the failure-timeout expires. I still think that the migration (failover threshold) concept should be somehow decoupled from the maximum failure count. Thanks, Dejan > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
