This is an automated email from the ASF dual-hosted git repository.

bmahler pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 543c01ab5c59695795fc6c7e6fc5bc02c3524121
Author: Benjamin Mahler <bmah...@apache.org>
AuthorDate: Mon Apr 15 14:44:36 2024 -0400

    Mitigate a case where the agent gets stuck sending TASK_DROPPED.
    
    Per MESOS-7187, there is a case where the master holds a stale resource
    UUID for the agent's resources, and all subsequent task launches result
    in the agent sending TASK_DROPPED due to "Task assumes outdated resource
    state".
    
    While this patch doesn't fix the general issue of MESOS-7187, it does
    mitigate a known problematic case due to the introduction of the agent
    having its own resource UUID.
---
 src/master/master.cpp | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/src/master/master.cpp b/src/master/master.cpp
index 4a601c97e..e4f40104a 100644
--- a/src/master/master.cpp
+++ b/src/master/master.cpp
@@ -7746,6 +7746,22 @@ void Master::updateSlave(UpdateSlaveMessage&& message)
     // providers as well.
   }
 
+  // We don't expect the agent's resource version to change, but above we
+  // do have a check to see if it's changed and therefore set `updated`
+  // to true, so we might as well assign the new value here rather than
+  // ignore it.
+  //
+  // Now that we have resource versions in the agent, the lack of the
+  // update to the version here was causing MESOS-7187 to be triggered
+  // when the master receives a re-registration message from the old run
+  // of an agent and then ignores the new re-registration message from a
+  // new run of the agent. When the new agent sends the update message,
+  // the master was seeing the different resource uuid and setting
+  // `updated`, but wasn't actually setting the new version.
+  if (message.has_resource_version_uuid()) {
+    slave->resourceVersion = message.resource_version_uuid();
+  }
+
   ReconcileOperationsMessage reconcile;
 
   // Reconcile operations on agent-default resources.

Reply via email to