On Fri, Apr 05, 2019 at 09:56:51AM -0500, Ken Gaillot wrote: > On Fri, 2019-04-05 at 09:44 -0500, Ken Gaillot wrote: > > On Fri, 2019-04-05 at 15:50 +0200, Lars Ellenberg wrote: > > > As mentioned in #clusterlabs, > > > but I think I post it here, so it won't get lost: > > > > > > pacemaker 1.1.19, in case that matters. > > > > > > "all good". > > > > > > provoking resource monitoring failure > > > (manually umount of some file system) > > > > > > monitoring failure triggers pengine run, > > > (input here is no fail-count in status section yet, > > > but failed monitoring operation) > > > results in "Recovery" transition. > > Ah, I misunderstood ... I was thinking the two writes were the fail > count and the last failure time node attributes, but that's not it (and > probably already handled sufficiently). > > The two writes are the recording of the operation status result, > followed by the recording of the fail count and last failure time node > attributes. > > That's more challenging. The controller needs to see the status result > to determine whether it was a failure or not, at which point it sends > the fail count change to the attribute manager, which writes it to the > CIB, and the controller can only act on it then. > > A possible solution would be to skip running the scheduler for the > status result if we're sending a fail count change. But that's risky > because it assumes everything involved in the fail count change will > succeed, and in a reasonable time. That's why the current approach is > to run the scheduler immediately and then re-run when the fail-count > change is confirmed.
I see. So there is that. We can live with that, I guess, it is very unlikely to cause issues "in the real world". > > > which is then aborted by the fail-count=1 update of the very > > > failure > > > that this recovery transition is about. > > > > > > Meanwhile, the "stop" operation was already scheduled, > > > and results in "OK", so the second pengine run > > > now has as input a fail-count=1, and a stopped resource. > > > > > > The second pengine run would usually come to the same result, > > > minus already completed actions, and no-one would notice. > > > I assume it has been like that for a long time? > > > > Yep, that's how it's always worked > > > > > But in this case, someone tried to be smart > > > and set a migration-threshold of "very large", > > > in this case the string in xml was: 999999999999, > > > and that probably is "parsed" into some negative value, > > > > Anything above "INFINITY" (actually 1,000,000) should be mapped to > > INFINITY. If that's not what happens, there's a bug. Running > > crm_simulate in verbose mode should be helpful. I think I found it already. char2score() does crm_parse_int(), and reasonably assumes that the result is the parsed int. Which it is not, if the result is -1, and errno is set to EINVAL or ERANGE ;-) char2score -> crm_parse_int "999999999999" -> result of strtoll is > INT_MAX, result -1, errno ERANGE migration_threshold = -1; Not sure what to do there, though. Yet an other helper, mapping ERANGE to appropriate MIN/MAX for the conversion? But any "sane" configuration would not even trigger that. Where and how would we point out the "in-sane-ness" to the user, though? > > > which means the fail-count=1 now results in "forcing away ...", > > > different resource placements, > > > and the file system placement elsewhere now results in much more > > > actions, demoting/role changes/movement of other dependent > > > resources > > > ... > > > > > > > > > So I think we have two issues here: > > > > > > a) I think the fail-count update should be visible as input in the > > > cib > > > before the pengine calculates the recovery transition. > > > > Agreed, but that's a nontrivial change to the current design. > > > > Currently, once the controller detects a failure, it sends a request > > to the attribute manager to update the fail-count attribute, then > > immediately sends another request to update the last-failure time. > > The attribute manager writes these out to the CIB as it gets them, > > the CIB notifies the controller as each write completes, and the > > controller calls the scheduler when it receives each notification. > > > > The ideal solution would be to have the controller send a single > > request to the attrd with both attributes, and have attrd write the > > values together. However the attribute manager's current > > client/server protocol and inner workings are designed for one > > attribute at a time. Writes can be grouped opportunistically or > > when a dampening is explicitly configured, but the main approach is > > to write each change as it is received. > > > > There would also need to be some special handling for mixed-version > > clusters (i.e. keeping it working during a rolling upgrade when some > > nodes have the new capability and some don't). > > > > It's certainly doable but a medium-sized project. If there are > > volunteers I can give some pointers :) but there's quite a backlog of > > projects at the moment. > > > > > b) migration-theshold (and possibly other scores) should be > > > properly > > > parsed/converted/capped/scaled/rejected > > > > That should already be happening -- : Lars Ellenberg : LINBIT | Keeping the Digital World Running : DRBD -- Heartbeat -- Corosync -- Pacemaker : R&D, Integration, Ops, Consulting, Support DRBD® and LINBIT® are registered trademarks of LINBIT _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/developers ClusterLabs home: https://www.clusterlabs.org/