[
https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16117569#comment-16117569
]
Benjamin Mahler commented on MESOS-7744:
----------------------------------------
Thanks for reporting this and including the logs [~sargun]I took a look, the
bug is due to a race introduced when Slave::statusUpdate was made asynchronous:
(1) Slave::__run completes, task is now within Executor::queuedTasks
(2) Slave::killTask locates the executor based on the TaskID residing in
queuedTasks, calls statusUpdate() with TASK_KILLED
(3) Slave::___run assumes that killed tasks have been removed from
Executor::queuedTasks, but this now occurs asynchronously in
Slave::_statusUpdate. So, the executor still sees the queued task and delivers
it and adds the task to Executor::launchedTasks.
(3) Slave::_statusUpdate runs, removes the task from Executor::launchedTasks
and adds it to Executor::terminatedTasks
I filed MESOS-7865 to capture the bug. Unfortunately, the fix will only be
cherry-picked back as far as 1.1.x.
> Mesos Agent Sends TASK_KILL status update to Master, and still launches task
> ----------------------------------------------------------------------------
>
> Key: MESOS-7744
> URL: https://issues.apache.org/jira/browse/MESOS-7744
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 1.0.1
> Reporter: Sargun Dhillon
> Priority: Minor
> Labels: reliability
>
> We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a
> TASK_STARTING back from the agent. Under certain conditions it can result in
> Mesos losing track of the task. The chunk of the logs which is interesting is
> here:
> {code}
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c
> mesos-slave[4290]: I0629 23:22:26.951799 5171 slave.cpp:1495] Got assigned
> task Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c
> mesos-slave[4290]: I0629 23:22:26.952251 5171 slave.cpp:1614] Launching task
> Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c
> mesos-slave[4290]: I0629 23:22:37.484611 5171 slave.cpp:1853] Queuing task
> ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework
> TitusFramework at executor(1)@100.66.11.10:17707
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c
> mesos-slave[4290]: I0629 23:22:37.487876 5171 slave.cpp:2035] Asked to kill
> task Titus-7590548-worker-0-4476 of framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c
> mesos-slave[4290]: I0629 23:22:37.488994 5171 slave.cpp:3211] Handling
> status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for
> task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c
> mesos-slave[4290]: I0629 23:22:37.490603 5171 slave.cpp:2005] Sending queued
> task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework
> TitusFramework at executor(1)@100.66.11.10:17707{
> {code}
> In our executor, we see that the launch message arrives after the master has
> already gotten the kill update. We then send non-terminal state updates to
> the agent, and yet it doesn't forward these to our framework. We're using a
> custom executor which is based on the older mesos-go bindings.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)