On Fri, 2019-04-05 at 15:50 +0200, Lars Ellenberg wrote: > As mentioned in #clusterlabs, > but I think I post it here, so it won't get lost: > > pacemaker 1.1.19, in case that matters. > > "all good". > > provoking resource monitoring failure > (manually umount of some file system) > > monitoring failure triggers pengine run, > (input here is no fail-count in status section yet, > but failed monitoring operation) > results in "Recovery" transition. > > > which is then aborted by the fail-count=1 update of the very failure > that this recovery transition is about. > > Meanwhile, the "stop" operation was already scheduled, > and results in "OK", so the second pengine run > now has as input a fail-count=1, and a stopped resource. > > The second pengine run would usually come to the same result, > minus already completed actions, and no-one would notice. > I assume it has been like that for a long time?
Yep, that's how it's always worked > But in this case, someone tried to be smart > and set a migration-threshold of "very large", > in this case the string in xml was: 999999999999, > and that probably is "parsed" into some negative value, Anything above "INFINITY" (actually 1,000,000) should be mapped to INFINITY. If that's not what happens, there's a bug. Running crm_simulate in verbose mode should be helpful. > which means the fail-count=1 now results in "forcing away ...", > different resource placements, > and the file system placement elsewhere now results in much more > actions, demoting/role changes/movement of other dependent resources > ... > > > So I think we have two issues here: > > a) I think the fail-count update should be visible as input in the > cib > before the pengine calculates the recovery transition. Agreed, but that's a nontrivial change to the current design. Currently, once the controller detects a failure, it sends a request to the attribute manager to update the fail-count attribute, then immediately sends another request to update the last-failure time. The attribute manager writes these out to the CIB as it gets them, the CIB notifies the controller as each write completes, and the controller calls the scheduler when it receives each notification. The ideal solution would be to have the controller send a single request to the attrd with both attributes, and have attrd write the values together. However the attribute manager's current client/server protocol and inner workings are designed for one attribute at a time. Writes can be grouped opportunistically or when a dampening is explicitly configured, but the main approach is to write each change as it is received. There would also need to be some special handling for mixed-version clusters (i.e. keeping it working during a rolling upgrade when some nodes have the new capability and some don't). It's certainly doable but a medium-sized project. If there are volunteers I can give some pointers :) but there's quite a backlog of projects at the moment. > b) migration-theshold (and possibly other scores) should be properly > parsed/converted/capped/scaled/rejected That should already be happening > What do you think? > > "someone" probably finds the relevant lines of code faster than I do > ;-) > > Cheers, > > Lars -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/developers ClusterLabs home: https://www.clusterlabs.org/