[
https://issues.apache.org/jira/browse/MESOS-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Bannier reassigned MESOS-9940:
---------------------------------------
Assignee: (was: Benjamin Bannier)
> Framework removal may lead to inconsistent task states between master and
> agent.
> --------------------------------------------------------------------------------
>
> Key: MESOS-9940
> URL: https://issues.apache.org/jira/browse/MESOS-9940
> Project: Mesos
> Issue Type: Bug
> Components: master
> Reporter: Meng Zhu
> Priority: Major
> Labels: foundations
>
> When a framework is removed from the master (say due to disconnection),
> master sends a `ShutdownFrameworkMessage` to the agent. At the same time,
> master would transition the task status to e.g. KILLED.
> (https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11247-L11291)
> When agent got the shutdown message, it would try to shutdown all the
> executor and destroy all the containers. The tasks' status is updated after
> all these are done.
> (https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L7914-L7922)
> However, if the executor shutdown gets stuck (e.g. due to hanging docker
> daemon), the task status transition will never happen. And master and agent
> will have diverged view of these tasks.
> One consequence is that masters may try to schedule more workloads onto the
> problematic agent (because it thinks those task resources are freed up).
> Since we do not have overcommit check on agent, agent will comply and launch
> those tasks. This will lead to over-allocation.
> One possible solution is to hold on the master status update until the agent
> is done with the framework shutdown.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)