-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61639/#review183266
-----------------------------------------------------------


Fix it, then Ship it!





src/slave/slave.cpp
Lines 4502 (patched)
<https://reviews.apache.org/r/61639/#comment259282>

    Can you expand a bit more in the comment on why we need to synchronously 
transition the task, for posterity?


- Vinod Kone


On Aug. 15, 2017, 7 a.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61639/
> -----------------------------------------------------------
> 
> (Updated Aug. 15, 2017, 7 a.m.)
> 
> 
> Review request for mesos, Anand Mazumdar and Vinod Kone.
> 
> 
> Bugs: MESOS-7744 and MESOS-7865
>     https://issues.apache.org/jira/browse/MESOS-7744
>     https://issues.apache.org/jira/browse/MESOS-7865
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The following race leads to the agent both killing and launching a task:
> 
>   (1) Slave::__run completes, task is now within Executor::queuedTasks.
>   (2) Slave::killTask locates the executor based on the task ID residing
>       in queuedTasks, calls Slave::statusUpdate() with TASK_KILLED.
>   (3) Slave::___run assumes that killed tasks have been removed from
>       Executor::queuedTasks, but this now occurs asynchronously in
>       Slave::_statusUpdate. So, the agent still sees the queued task
>       and delivers it and adds the task to Executor::launchedTasks.
>   (3) Slave::_statusUpdate runs, removes the task from
>       Executor::launchedTasks and adds it to Executor::terminatedTasks.
> 
> The fix applied here is to synchronously transition queued tasks to
> a terminal state when statusUpdate is called. This can be done because
> for queued tasks, we do not need to retrieve the container status (the
> task never reached the container).
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 1fe93dab1b2bef24721cc1bcffebe1b259e96d79 
>   src/slave/slave.cpp 7381530515f86faf4c3e8f82bcd9483f6cf0498b 
> 
> 
> Diff: https://reviews.apache.org/r/61639/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> SlaveTest.KillQueuedTaskDuringExecutorRegistration captures this case, but it 
> did not delay retrieving the container status. This test could have been 
> updated previously to delay the container status, but now there is no 
> container status to delay, so I've left the test alone.
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>

Reply via email to