[ 
https://issues.apache.org/jira/browse/MESOS-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-9940:
---------------------------------------

    Assignee:     (was: Benjamin Bannier)

> Framework removal may lead to inconsistent task states between master and 
> agent.
> --------------------------------------------------------------------------------
>
>                 Key: MESOS-9940
>                 URL: https://issues.apache.org/jira/browse/MESOS-9940
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>            Reporter: Meng Zhu
>            Priority: Major
>              Labels: foundations
>
> When a framework is removed from the master (say due to disconnection), 
> master sends a `ShutdownFrameworkMessage` to the agent. At the same time, 
> master would transition the task status to e.g. KILLED. 
> (https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11247-L11291)
> When agent got the shutdown message, it would try to shutdown all the 
> executor and destroy all the containers. The tasks' status is updated after 
> all these are done. 
> (https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L7914-L7922)
> However, if the executor shutdown gets stuck (e.g. due to hanging docker 
> daemon), the task status transition will never happen. And master and agent 
> will have diverged view of these tasks.
> One consequence is that masters may try to schedule more workloads onto the 
> problematic agent (because it thinks those task resources are freed up). 
> Since we do not have overcommit check on agent, agent will comply and launch 
> those tasks. This will lead to over-allocation.
> One possible solution is to hold on the master status update until the agent 
> is done with the framework shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to