As mentioned in #clusterlabs, but I think I post it here, so it won't get lost:
pacemaker 1.1.19, in case that matters. "all good". provoking resource monitoring failure (manually umount of some file system) monitoring failure triggers pengine run, (input here is no fail-count in status section yet, but failed monitoring operation) results in "Recovery" transition. which is then aborted by the fail-count=1 update of the very failure that this recovery transition is about. Meanwhile, the "stop" operation was already scheduled, and results in "OK", so the second pengine run now has as input a fail-count=1, and a stopped resource. The second pengine run would usually come to the same result, minus already completed actions, and no-one would notice. I assume it has been like that for a long time? But in this case, someone tried to be smart and set a migration-threshold of "very large", in this case the string in xml was: 999999999999, and that probably is "parsed" into some negative value, which means the fail-count=1 now results in "forcing away ...", different resource placements, and the file system placement elsewhere now results in much more actions, demoting/role changes/movement of other dependent resources ... So I think we have two issues here: a) I think the fail-count update should be visible as input in the cib before the pengine calculates the recovery transition. b) migration-theshold (and possibly other scores) should be properly parsed/converted/capped/scaled/rejected What do you think? "someone" probably finds the relevant lines of code faster than I do ;-) Cheers, Lars _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/developers ClusterLabs home: https://www.clusterlabs.org/