Repository: mesos
Updated Branches:
  refs/heads/master 843e5e859 -> 8adb5fcb1


Improved handling of non-terminal operations after master failover.

This patch fixes the handling of non-terminal operations learned by a
newly elected master after a master failover, so that only these
operations are counted as using resources. Previously we did not count
any operations as using resources which by accident produced expected
behavior if the operation was already terminal when the master learned
about them.

We do not address the issue of being unable to properly account for
operations triggered by frameworks unknown to the master, see
MESOS-8582. Instead we emit a warning for now since the master might
continue to abort due to assertion failures due to incomplete resource
accounting.

Review: https://reviews.apache.org/r/65482/


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/8adb5fcb
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/8adb5fcb
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/8adb5fcb

Branch: refs/heads/master
Commit: 8adb5fcb1f6c451bc9ad7ecdc6e39bc170fdcd65
Parents: 843e5e8
Author: Benjamin Bannier <benjamin.bann...@mesosphere.io>
Authored: Mon Mar 12 18:07:24 2018 +0100
Committer: Benjamin Bannier <bbann...@apache.org>
Committed: Mon Mar 12 18:29:07 2018 +0100

----------------------------------------------------------------------
 src/master/master.cpp | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/8adb5fcb/src/master/master.cpp
----------------------------------------------------------------------
diff --git a/src/master/master.cpp b/src/master/master.cpp
index f0f6e5b..223ebf2 100644
--- a/src/master/master.cpp
+++ b/src/master/master.cpp
@@ -7596,6 +7596,37 @@ void Master::updateSlave(UpdateSlaveMessage&& message)
           }
 
           addOperation(framework, slave, new Operation(operation));
+
+          if (!protobuf::isTerminalState(operation.latest_status().state())) {
+            // If we do not yet know the `FrameworkInfo` of the framework the
+            // operation originated from, we cannot properly track the 
operation
+            // at this point.
+            //
+            // TODO(bbannier): Consider introducing ways of making
+            // sure an agent always knows the `FrameworkInfo` of
+            // operations triggered on its resources, e.g., by adding
+            // an explicit `FrameworkInfo` to operations like is
+            // already done for `RunTaskMessage`, see MESOS-8582.
+            if (framework == nullptr) {
+              LOG(WARNING)
+                << "Cannot properly account for operation " << operation.uuid()
+                << " learnt in reconciliation of agent " << slaveId
+                << " since framework " << operation.framework_id()
+                << " is unknown; this can lead to assertion failures after the"
+                   " operation terminates, see MESOS-8536";
+              continue;
+            }
+
+            Try<Resources> consumedResources =
+              protobuf::getConsumedResources(operation.info());
+
+            CHECK_SOME(consumedResources)
+              << "Could not determine resources consumed by operation "
+              << operation.uuid();
+
+            usedByOperations[operation.framework_id()] +=
+              consumedResources.get();
+          }
         }
       }
 

Reply via email to