Neil Conway created MESOS-6619:
----------------------------------

             Summary: Duplicate elements in "completed_tasks"
                 Key: MESOS-6619
                 URL: https://issues.apache.org/jira/browse/MESOS-6619
             Project: Mesos
          Issue Type: Bug
          Components: master
            Reporter: Neil Conway
            Assignee: Neil Conway


Scenario:

# Framework starts non-partition-aware task T on agent A
# Agent A is partitioned. Task T is marked as a "completed task" in the 
{{Framework}} struct of the master, as part of {{Framework::removeTask}}.
# Agent A re-registers with the master. The tasks running on A are re-added to 
their respective frameworks on the master as running tasks.
# In {{Master::_reregisterSlave}}, the master sends a 
{{ShutdownFrameworkMessage}} for all non-partition-aware frameworks running on 
the agent. The master then does {{removeTask}} for each task managed by one of 
these frameworks, which results in calling {{Framework::removeTask}}, which 
adds _another_ task to {{completed_tasks}}. Note that {{completed_tasks}} does 
not attempt to detect/suppress duplicates, so this results in two elements in 
the {{completed_tasks}} collection.

Similar problems occur when a partition-aware task is running on a partitioned 
agent that re-registers: the result is a task in the {{tasks}} list _and_ a 
task in the {{completed_tasks}} list.

Possible fixes/changes:

* Adding a task to the {{completed_tasks}} list when an agent becomes 
partitioned is debatable; certainly for partition-aware tasks, the task is not 
"completed". We might consider adding an "{{unreachable_tasks}}" list to the 
HTTP endpoints.
* Regardless of whether we continue to use {{completed_tasks}} or add a new 
collection, we should ensure the consistency of that data structure after agent 
re-registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to