Repository: mesos Updated Branches: refs/heads/master 843e5e859 -> 8adb5fcb1
Improved handling of non-terminal operations after master failover. This patch fixes the handling of non-terminal operations learned by a newly elected master after a master failover, so that only these operations are counted as using resources. Previously we did not count any operations as using resources which by accident produced expected behavior if the operation was already terminal when the master learned about them. We do not address the issue of being unable to properly account for operations triggered by frameworks unknown to the master, see MESOS-8582. Instead we emit a warning for now since the master might continue to abort due to assertion failures due to incomplete resource accounting. Review: https://reviews.apache.org/r/65482/ Project: http://git-wip-us.apache.org/repos/asf/mesos/repo Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/8adb5fcb Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/8adb5fcb Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/8adb5fcb Branch: refs/heads/master Commit: 8adb5fcb1f6c451bc9ad7ecdc6e39bc170fdcd65 Parents: 843e5e8 Author: Benjamin Bannier <benjamin.bann...@mesosphere.io> Authored: Mon Mar 12 18:07:24 2018 +0100 Committer: Benjamin Bannier <bbann...@apache.org> Committed: Mon Mar 12 18:29:07 2018 +0100 ---------------------------------------------------------------------- src/master/master.cpp | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/mesos/blob/8adb5fcb/src/master/master.cpp ---------------------------------------------------------------------- diff --git a/src/master/master.cpp b/src/master/master.cpp index f0f6e5b..223ebf2 100644 --- a/src/master/master.cpp +++ b/src/master/master.cpp @@ -7596,6 +7596,37 @@ void Master::updateSlave(UpdateSlaveMessage&& message) } addOperation(framework, slave, new Operation(operation)); + + if (!protobuf::isTerminalState(operation.latest_status().state())) { + // If we do not yet know the `FrameworkInfo` of the framework the + // operation originated from, we cannot properly track the operation + // at this point. + // + // TODO(bbannier): Consider introducing ways of making + // sure an agent always knows the `FrameworkInfo` of + // operations triggered on its resources, e.g., by adding + // an explicit `FrameworkInfo` to operations like is + // already done for `RunTaskMessage`, see MESOS-8582. + if (framework == nullptr) { + LOG(WARNING) + << "Cannot properly account for operation " << operation.uuid() + << " learnt in reconciliation of agent " << slaveId + << " since framework " << operation.framework_id() + << " is unknown; this can lead to assertion failures after the" + " operation terminates, see MESOS-8536"; + continue; + } + + Try<Resources> consumedResources = + protobuf::getConsumedResources(operation.info()); + + CHECK_SOME(consumedResources) + << "Could not determine resources consumed by operation " + << operation.uuid(); + + usedByOperations[operation.framework_id()] += + consumedResources.get(); + } } }