This is an automated email from the ASF dual-hosted git repository.
bbannier pushed a commit to branch 1.9.x
in repository https://gitbox.apache.org/repos/asf/mesos.git
The following commit(s) were added to refs/heads/1.9.x by this push:
new c313168 Garbage-collected lost tasks which are reported as running
again.
c313168 is described below
commit c31316814398990abf1013bb0681a907426a4fec
Author: Benjamin Bannier <[email protected]>
AuthorDate: Fri Nov 1 13:08:35 2019 +0100
Garbage-collected lost tasks which are reported as running again.
Under certain conditions tasks which were previously `TASK_LOST` and
completed can reappear in non-terminal states, e.g., if the agent on
which they where running reconnect.
This patch adds garbage collection of such completed tasks so that users
do not see tasks twice when obtaining task information from the master
API. This change does not affect tasks status updates where we already
correctly reported a previously `TASK_LOST` state as superseded by e.g.,
`TASK_RUNNING`.
Review: https://reviews.apache.org/r/71641/
---
src/master/master.cpp | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/src/master/master.cpp b/src/master/master.cpp
index 933fc89..73507ce 100644
--- a/src/master/master.cpp
+++ b/src/master/master.cpp
@@ -7848,6 +7848,24 @@ void Master::__reregisterSlave(
Framework* framework = getFramework(frameworkId);
if (framework != nullptr) {
framework->unreachableTasks.erase(task.task_id());
+
+ // The master transitions task to terminal state on its own in certain
+ // scenarios (e.g., framework or agent teardown) before instructing the
+ // agent to remove it. However, we are not guaranteed that the message
+ // reaches the agent and is processed by it. If the agent fails to act
+ // on the message, tasks the master has declared terminal might
reappear
+ // from the agent as non-terminal, see e.g., MESOS-9940.
+ //
+ // Avoid tracking a task as both terminal and non-terminal by
+ // garbage-collected completed tasks which come back as running.
+ framework->completedTasks.erase(
+ std::remove_if(
+ framework->completedTasks.begin(),
+ framework->completedTasks.end(),
+ [&](const Owned<Task>& task_) {
+ return task_.get() && task_->task_id() == task.task_id();
+ }),
+ framework->completedTasks.end());
}
const string message = slaves.unreachable.contains(slaveInfo.id())