This is an automated email from the ASF dual-hosted git repository. bmahler pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git
commit 543c01ab5c59695795fc6c7e6fc5bc02c3524121 Author: Benjamin Mahler <bmah...@apache.org> AuthorDate: Mon Apr 15 14:44:36 2024 -0400 Mitigate a case where the agent gets stuck sending TASK_DROPPED. Per MESOS-7187, there is a case where the master holds a stale resource UUID for the agent's resources, and all subsequent task launches result in the agent sending TASK_DROPPED due to "Task assumes outdated resource state". While this patch doesn't fix the general issue of MESOS-7187, it does mitigate a known problematic case due to the introduction of the agent having its own resource UUID. --- src/master/master.cpp | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/src/master/master.cpp b/src/master/master.cpp index 4a601c97e..e4f40104a 100644 --- a/src/master/master.cpp +++ b/src/master/master.cpp @@ -7746,6 +7746,22 @@ void Master::updateSlave(UpdateSlaveMessage&& message) // providers as well. } + // We don't expect the agent's resource version to change, but above we + // do have a check to see if it's changed and therefore set `updated` + // to true, so we might as well assign the new value here rather than + // ignore it. + // + // Now that we have resource versions in the agent, the lack of the + // update to the version here was causing MESOS-7187 to be triggered + // when the master receives a re-registration message from the old run + // of an agent and then ignores the new re-registration message from a + // new run of the agent. When the new agent sends the update message, + // the master was seeing the different resource uuid and setting + // `updated`, but wasn't actually setting the new version. + if (message.has_resource_version_uuid()) { + slave->resourceVersion = message.resource_version_uuid(); + } + ReconcileOperationsMessage reconcile; // Reconcile operations on agent-default resources.