-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61639/
-----------------------------------------------------------
Review request for mesos, Anand Mazumdar and Vinod Kone.
Bugs: MESOS-7744 and MESOS-7865
https://issues.apache.org/jira/browse/MESOS-7744
https://issues.apache.org/jira/browse/MESOS-7865
Repository: mesos
Description
-------
The following race leads to the agent both killing and launching a task:
(1) Slave::__run completes, task is now within Executor::queuedTasks.
(2) Slave::killTask locates the executor based on the task ID residing
in queuedTasks, calls Slave::statusUpdate() with TASK_KILLED.
(3) Slave::___run assumes that killed tasks have been removed from
Executor::queuedTasks, but this now occurs asynchronously in
Slave::_statusUpdate. So, the agent still sees the queued task
and delivers it and adds the task to Executor::launchedTasks.
(3) Slave::_statusUpdate runs, removes the task from
Executor::launchedTasks and adds it to Executor::terminatedTasks.
The fix applied here is to synchronously transition queued tasks to
a terminal state when statusUpdate is called. This can be done because
for queued tasks, we do not need to retrieve the container status (the
task never reached the container).
Diffs
-----
src/slave/slave.hpp 1fe93dab1b2bef24721cc1bcffebe1b259e96d79
src/slave/slave.cpp 7381530515f86faf4c3e8f82bcd9483f6cf0498b
Diff: https://reviews.apache.org/r/61639/diff/1/
Testing
-------
make check
SlaveTest.KillQueuedTaskDuringExecutorRegistration captures this case, but it
did not delay retrieving the container status. This test could have been
updated previously to delay the container status, but now there is no container
status to delay, so I've left the test alone.
Thanks,
Benjamin Mahler