[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jie Yu updated MESOS-2367: -------------------------- Sprint: Twitter Mesos Q1 Sprint 5, Twitter Q2 Sprint 1 - 4/13 (was: Twitter Mesos Q1 Sprint 5) > Improve slave resiliency in the face of orphan containers > ---------------------------------------------------------- > > Key: MESOS-2367 > URL: https://issues.apache.org/jira/browse/MESOS-2367 > Project: Mesos > Issue Type: Bug > Components: slave > Reporter: Joe Smith > Assignee: Jie Yu > Priority: Critical > > Right now there's a case where a misbehaving executor can cause a slave > process to flap: > {panel:title=Quote From [~jieyu]} > {quote} > 1) User tries to kill an instance > 2) Slave sends {{KillTaskMessage}} to executor > 3) Executor sends kill signals to task processes > 4) Executor sends {{TASK_KILLED}} to slave > 5) Slave updates container cpu limit to be 0.01 cpus > 6) A user-process is still processing the kill signal > 7) the task process cannot exit since it has too little cpu share and is > throttled > 8) Executor itself terminates > 9) Slave tries to destroy the container, but cannot because the user-process > is stuck in the exit path. > 10) Slave restarts, and is constantly flapping because it cannot kill orphan > containers > {quote} > {panel} > The slave's orphan container handling should be improved to deal with this > case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)