Re: Review Request 23912: Fix MESOS-947: Slave should properly handle a killTask() that arrives between runTask() and _runTask()

Bernd Mathiske Mon, 13 Oct 2014 03:52:13 -0700


> On Oct. 10, 2014, 11:24 a.m., Vinod Kone wrote:
> > src/slave/slave.cpp, line 1201
> > <https://reviews.apache.org/r/23912/diff/7-8/?file=715213#file715213line1201>
> >
> >     why the change here?
> >     
> >     is it possible for pending[executorId] to be empty() if you didn't 
> > erase the task just above? i would revert this.


There are two cases.
1. "pending" means that the task is about to be started and has NOT been 
killed. In this case, if this is the last task that is about to start with the 
same executor, we remove the executorId from the pending list.
2. "not pending" means that the task has been killed meanwhile. If in this case 
there are no other tasks pending for this executor, we ALSO remove the 
executorId. 

If we move the executorId removal and the empty() check inside the "pending" 
case only, we have a bug. Then the framework does not get removed either 
further down the line.


> On Oct. 10, 2014, 11:24 a.m., Vinod Kone wrote:
> > src/tests/slave_tests.cpp, line 1087
> > <https://reviews.apache.org/r/23912/diff/7-8/?file=715216#file715216line1087>
> >
> >     How can removeFramework() be called twice!!!???? Wouldn't that throw a 
> > CHECK failure?
> 
> Bernd Mathiske wrote:
>     I thought that shutting down would call it again, but no. The debugger 
> says gmock reacts to the Master::removeFramework call, not 
> Slave::removeFramework, here. Will investigate.

I found out what the problem is: embarassing gmock pilot error. I should have 
called the original removeFramework, then there is no second coming of it in 
Shutdown(). Fixed now.


- Bernd


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23912/#review56016
-----------------------------------------------------------


On Oct. 9, 2014, 7:10 a.m., Bernd Mathiske wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23912/
> -----------------------------------------------------------
> 
> (Updated Oct. 9, 2014, 7:10 a.m.)
> 
> 
> Review request for mesos.
> 
> 
> Bugs: MESOS-947
>     https://issues.apache.org/jira/browse/MESOS-947
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixes MESOS-947 "Slave should properly handle a killTask() that arrives 
> between runTask() and _runTask()".
> 
> Slave::killTask() did not check for task in question combination to be 
> "pending" (i.e. Slave::runTask had happened, but Slave::_runTask had not yet) 
> and then erroneously assumed that Slave::runTask() had not been executed. The 
> task was then marked "LOST" instead of "KILLED". But Slave::runTask had 
> already scheduled Slave::_runTask to follow. Now the entry for being 
> "pending" is removed, and the task is marked "KILLED", and _runTask gets 
> informed about this. It checks whether the task in question is currently 
> "pending" and if it is not, then it infers that the task has been killed and 
> does not erroneously try to complete launching it.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 76d505c698774204b2536b66ea8a83a9a2a5e2c1 
>   src/slave/slave.cpp cb3759993f863590cb1545c73072feb0331aa6c9 
>   src/tests/mesos.hpp 957e2233cc11c438fd80d3b6d1907a1223093104 
>   src/tests/mesos.cpp 3dcb2acd5ad4ab5e3a7b4fe524ee077558112773 
>   src/tests/slave_tests.cpp 69be28f6e82b99e23424bd2be8294f715d8040d4 
> 
> Diff: https://reviews.apache.org/r/23912/diff/
> 
> 
> Testing
> -------
> 
> Wrote a unit test that reliably created the situation described in the 
> ticket. Observed that TASK_LOST and the listed log output occurred. This 
> pointed directly to the lines in killTask() where the problem is rooted. Ran 
> the test after fixing, it succeeded. Checked the log. It looks like a "clean 
> kill" now :-)
> 
> 
> Thanks,
> 
> Bernd Mathiske
> 
>

Re: Review Request 23912: Fix MESOS-947: Slave should properly handle a killTask() that arrives between runTask() and _runTask()

Reply via email to