[
https://issues.apache.org/jira/browse/MESOS-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361885#comment-16361885
]
Benjamin Mahler commented on MESOS-8411:
----------------------------------------
{noformat}
commit bc6b61bca37752689cffa40a14c53ad89f24e8fc
Author: Meng Zhu <[email protected]>
Date: Mon Feb 12 22:29:53 2018 -0800
Added test to verify task-less executor is shutdown when re-subscribing.
This test verifies that the v1 executor is shutdown if all of its
initial tasks could not be delivered when re-subscribing with
a recovered agent. See MESOS-8411.
Review: https://reviews.apache.org/r/65497/
{noformat}
> Killing a queued task can lead to the command executor never terminating.
> -------------------------------------------------------------------------
>
> Key: MESOS-8411
> URL: https://issues.apache.org/jira/browse/MESOS-8411
> Project: Mesos
> Issue Type: Bug
> Components: agent
> Affects Versions: 1.3.1, 1.4.1, 1.5.0
> Reporter: Benjamin Mahler
> Assignee: Meng Zhu
> Priority: Critical
> Fix For: 1.4.2, 1.6.0, 1.5.1, 1.3.3
>
>
> If a task is killed while the executor is re-registering, we will remove it
> from queued tasks and shut down the executor if all the its initial tasks
> could not be delivered. However, there is a case (within {{Slave::___run}})
> where we leave the executor running, the race is:
> # Command-executor task launched.
> # Command executor sends registration message. Agent tells containerizer to
> update the resources before it sends the tasks to the executor.
> # Kill arrives, and we synchronously remove the task from queued tasks.
> # Containerizer finishes updating the resources, and in {{Slave::___run}} the
> killed task is ignored.
> # Command executor stays running!
> Executors could have a timeout to handle this case, but it's not clear that
> all executors will implement this correctly. It would be better to have a
> defensive policy that will shut down an executor if all of its initial batch
> of tasks were killed prior to delivery.
> In order to implement this, one approach discussed with [~vinodkone] is to
> look at the running + terminated but unacked + completed tasks, and if empty,
> shut the executor down in the {{Slave::___run}} path. This will require us to
> check that the completed task cache size is set to at least 1, and this also
> assumes that the completed tasks are not cleared based on time or during
> agent recovery.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)