[jira] [Commented] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task

Benjamin Mahler (JIRA) Mon, 07 Aug 2017 16:50:49 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16117569#comment-16117569
 ]


Benjamin Mahler commented on MESOS-7744:
----------------------------------------

Thanks for reporting this and including the logs [~sargun]I took a look, the 
bug is due to a race introduced when Slave::statusUpdate was made asynchronous:

(1) Slave::__run completes, task is now within Executor::queuedTasks
(2) Slave::killTask locates the executor based on the TaskID residing in 
queuedTasks, calls statusUpdate() with TASK_KILLED
(3) Slave::___run assumes that killed tasks have been removed from 
Executor::queuedTasks, but this now occurs asynchronously in 
Slave::_statusUpdate. So, the executor still sees the queued task and delivers 
it and adds the task to Executor::launchedTasks.
(3) Slave::_statusUpdate runs, removes the task from Executor::launchedTasks 
and adds it to Executor::terminatedTasks

I filed MESOS-7865 to capture the bug. Unfortunately, the fix will only be 
cherry-picked back as far as 1.1.x.

> Mesos Agent Sends TASK_KILL status update to Master, and still launches task
> ----------------------------------------------------------------------------
>
>                 Key: MESOS-7744
>                 URL: https://issues.apache.org/jira/browse/MESOS-7744
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.1
>            Reporter: Sargun Dhillon
>            Priority: Minor
>              Labels: reliability
>
> We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a 
> TASK_STARTING back from the agent. Under certain conditions it can result in 
> Mesos losing track of the task. The chunk of the logs which is interesting is 
> here:
> {code}
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.951799  5171 slave.cpp:1495] Got assigned 
> task Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.952251  5171 slave.cpp:1614] Launching task 
> Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.484611  5171 slave.cpp:1853] Queuing task 
> ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.487876  5171 slave.cpp:2035] Asked to kill 
> task Titus-7590548-worker-0-4476 of framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.488994  5171 slave.cpp:3211] Handling 
> status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for 
> task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.490603  5171 slave.cpp:2005] Sending queued 
> task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707{
> {code}
> In our executor, we see that the launch message arrives after the master has 
> already gotten the kill update. We then send non-terminal state updates to 
> the agent, and yet it doesn't forward these to our framework. We're using a 
> custom executor which is based on the older mesos-go bindings. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task

Reply via email to