Neil Conway created MESOS-6619:
----------------------------------
Summary: Duplicate elements in "completed_tasks"
Key: MESOS-6619
URL: https://issues.apache.org/jira/browse/MESOS-6619
Project: Mesos
Issue Type: Bug
Components: master
Reporter: Neil Conway
Assignee: Neil Conway
Scenario:
# Framework starts non-partition-aware task T on agent A
# Agent A is partitioned. Task T is marked as a "completed task" in the
{{Framework}} struct of the master, as part of {{Framework::removeTask}}.
# Agent A re-registers with the master. The tasks running on A are re-added to
their respective frameworks on the master as running tasks.
# In {{Master::_reregisterSlave}}, the master sends a
{{ShutdownFrameworkMessage}} for all non-partition-aware frameworks running on
the agent. The master then does {{removeTask}} for each task managed by one of
these frameworks, which results in calling {{Framework::removeTask}}, which
adds _another_ task to {{completed_tasks}}. Note that {{completed_tasks}} does
not attempt to detect/suppress duplicates, so this results in two elements in
the {{completed_tasks}} collection.
Similar problems occur when a partition-aware task is running on a partitioned
agent that re-registers: the result is a task in the {{tasks}} list _and_ a
task in the {{completed_tasks}} list.
Possible fixes/changes:
* Adding a task to the {{completed_tasks}} list when an agent becomes
partitioned is debatable; certainly for partition-aware tasks, the task is not
"completed". We might consider adding an "{{unreachable_tasks}}" list to the
HTTP endpoints.
* Regardless of whether we continue to use {{completed_tasks}} or add a new
collection, we should ensure the consistency of that data structure after agent
re-registration.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)