Joe Smith created MESOS-2367:
--------------------------------
Summary: Improve slave resiliency in the face of orphan containers
Key: MESOS-2367
URL: https://issues.apache.org/jira/browse/MESOS-2367
Project: Mesos
Issue Type: Bug
Components: slave
Reporter: Joe Smith
Right now there's a case where a misbehaving executor can cause a slave process
to flap:
{panel:title=Quote From [~jieyu]}
{quote}
1) User tries to kill an instance
2) Slave sends {{KillTaskMessage}} to executor
3) Executor sends kill signals to task processes
4) Executor sends {{TASK_KILLED}} to slave
5) Slave updates container cpu limit to be 0.01 cpus
6) A user-process is still processing the kill signal
7) the task process cannot exit since it has too little cpu share and is
throttled
8) Executor itself terminates
9) Slave tries to destroy the container, but cannot because the user-process is
stuck in the exit path.
10) Slave restarts, and is constantly flapping because it cannot kill orphan
containers
{quote}
{panel}
The slave's orphan container handling should be improved to deal with this case
despite ill-behaved users (framework writers).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)