[ 
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-2367:
--------------------------
    Sprint: Twitter Mesos Q1 Sprint 5, Twitter Q2 Sprint 1 - 4/13  (was: 
Twitter Mesos Q1 Sprint 5)

> Improve slave resiliency in the face of orphan containers 
> ----------------------------------------------------------
>
>                 Key: MESOS-2367
>                 URL: https://issues.apache.org/jira/browse/MESOS-2367
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>            Reporter: Joe Smith
>            Assignee: Jie Yu
>            Priority: Critical
>
> Right now there's a case where a misbehaving executor can cause a slave 
> process to flap:
> {panel:title=Quote From [~jieyu]}
> {quote}
> 1) User tries to kill an instance
> 2) Slave sends {{KillTaskMessage}} to executor
> 3) Executor sends kill signals to task processes
> 4) Executor sends {{TASK_KILLED}} to slave
> 5) Slave updates container cpu limit to be 0.01 cpus
> 6) A user-process is still processing the kill signal
> 7) the task process cannot exit since it has too little cpu share and is 
> throttled
> 8) Executor itself terminates
> 9) Slave tries to destroy the container, but cannot because the user-process 
> is stuck in the exit path.
> 10) Slave restarts, and is constantly flapping because it cannot kill orphan 
> containers
> {quote}
> {panel}
> The slave's orphan container handling should be improved to deal with this 
> case despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to