On Mon, Sep 21, 2009 at 1:41 PM, Dejan Muhamedagic <[email protected]> wrote: > On Mon, Sep 21, 2009 at 01:14:05PM +0200, Andrew Beekhof wrote: >> On Mon, Sep 21, 2009 at 12:56 PM, Dejan Muhamedagic <[email protected]> >> wrote: >> > Hi, >> > >> > What if there's a transient error condition for whatever reason >> > which makes a resource fail in quick succession on all nodes. >> > To me it looks like that long failure-timeout in combination with >> > small migration-threshold is not a viable configuration. >> >> So use a small failure-timeout too. >> Nothing wrong with that, just make it larger than however long the >> resource takes to start up. >> >> > Permanent errors such as ERR_CONFIGURED or ERR_INSTALLED bump the >> > failure-count to INFINITY which then prevents pengine from >> > scheduling start for that resource forever. >> >> Yeah, but then we're not talking about transient errors anymore are we. >> >> And actually you're totally wrong here. >> >> The error code returned has no effect on what fail-count is set to and > > How does pengine know not to start a resource on a node if it's > not properly configured (ERR_CONFIGURED) or prerequisites > installed (ERR_INSTALLED)? Just by looking at the error code?
Correct > I thought I saw that fail-count was set to INFINITY on such > failures. Not in recent history. >> the start-failure-is-fatal option will tell the cluster to increment >> fail-count instead of setting it to infinity. >> Stop failures will end up with the node being fenced anyway, so its >> largely irrelevant what fail-count is set to. > > Of course. This was just about monitor and start failures. > >> > The >> > migration-threshold basically lowers that limit for transient >> > errors until the failure-timeout expires. I still think that the >> > migration (failover threshold) concept should be somehow >> > decoupled from the maximum failure count. >> >> What would the maximum failure count do then? > > That's what we're talking about: the migration-threshold prevents > the resource from starting on a node. So, it sounds more like > maximum failure count to me. Anyway, given the circumstances, I > guess that the failure-timeout should be taken set depending on > the migration-threshold, perhaps something like this: > > failure-timeout = n * migration-threshold * max-timeout + retry_wait > > where n is the number of nodes, max-timeout is the maximum of > start and monitor timeouts, and retry_wait is some period for > which the CRM should wait until trying to start resources again. Thats a pretty good guide. Not sure I'd want to make that the default though. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
