[
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jie Yu reassigned MESOS-2367:
-----------------------------
Assignee: Jie Yu
> Improve slave resiliency in the face of orphan containers
> ----------------------------------------------------------
>
> Key: MESOS-2367
> URL: https://issues.apache.org/jira/browse/MESOS-2367
> Project: Mesos
> Issue Type: Bug
> Components: slave
> Reporter: Joe Smith
> Assignee: Jie Yu
> Priority: Critical
>
> Right now there's a case where a misbehaving executor can cause a slave
> process to flap:
> {panel:title=Quote From [~jieyu]}
> {quote}
> 1) User tries to kill an instance
> 2) Slave sends {{KillTaskMessage}} to executor
> 3) Executor sends kill signals to task processes
> 4) Executor sends {{TASK_KILLED}} to slave
> 5) Slave updates container cpu limit to be 0.01 cpus
> 6) A user-process is still processing the kill signal
> 7) the task process cannot exit since it has too little cpu share and is
> throttled
> 8) Executor itself terminates
> 9) Slave tries to destroy the container, but cannot because the user-process
> is stuck in the exit path.
> 10) Slave restarts, and is constantly flapping because it cannot kill orphan
> containers
> {quote}
> {panel}
> The slave's orphan container handling should be improved to deal with this
> case despite ill-behaved users (framework writers).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)