----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61639/#review183266 -----------------------------------------------------------
Fix it, then Ship it! src/slave/slave.cpp Lines 4502 (patched) <https://reviews.apache.org/r/61639/#comment259282> Can you expand a bit more in the comment on why we need to synchronously transition the task, for posterity? - Vinod Kone On Aug. 15, 2017, 7 a.m., Benjamin Mahler wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61639/ > ----------------------------------------------------------- > > (Updated Aug. 15, 2017, 7 a.m.) > > > Review request for mesos, Anand Mazumdar and Vinod Kone. > > > Bugs: MESOS-7744 and MESOS-7865 > https://issues.apache.org/jira/browse/MESOS-7744 > https://issues.apache.org/jira/browse/MESOS-7865 > > > Repository: mesos > > > Description > ------- > > The following race leads to the agent both killing and launching a task: > > (1) Slave::__run completes, task is now within Executor::queuedTasks. > (2) Slave::killTask locates the executor based on the task ID residing > in queuedTasks, calls Slave::statusUpdate() with TASK_KILLED. > (3) Slave::___run assumes that killed tasks have been removed from > Executor::queuedTasks, but this now occurs asynchronously in > Slave::_statusUpdate. So, the agent still sees the queued task > and delivers it and adds the task to Executor::launchedTasks. > (3) Slave::_statusUpdate runs, removes the task from > Executor::launchedTasks and adds it to Executor::terminatedTasks. > > The fix applied here is to synchronously transition queued tasks to > a terminal state when statusUpdate is called. This can be done because > for queued tasks, we do not need to retrieve the container status (the > task never reached the container). > > > Diffs > ----- > > src/slave/slave.hpp 1fe93dab1b2bef24721cc1bcffebe1b259e96d79 > src/slave/slave.cpp 7381530515f86faf4c3e8f82bcd9483f6cf0498b > > > Diff: https://reviews.apache.org/r/61639/diff/1/ > > > Testing > ------- > > make check > > SlaveTest.KillQueuedTaskDuringExecutorRegistration captures this case, but it > did not delay retrieving the container status. This test could have been > updated previously to delay the container status, but now there is no > container status to delay, so I've left the test alone. > > > Thanks, > > Benjamin Mahler > >
