Re: Review Request 23912: Fix MESOS-947: Slave should properly handle a killTask() that arrives between runTask() and _runTask()

Bernd Mathiske Wed, 15 Oct 2014 00:57:04 -0700


> On Oct. 10, 2014, 11:24 a.m., Vinod Kone wrote:
> > src/slave/slave.cpp, line 1413
> > <https://reviews.apache.org/r/23912/diff/8/?file=716689#file716689line1413>
> >
> >     It is weird to me that you remove the task here but (potentially) 
> > remove the executor up in _runTask(). It's not obvious to me why you made 
> > that choice. If there is a good reason, please add a comment here.
> 
> Bernd Mathiske wrote:
>     The task is removed in killTask(), because later on the information that 
> it should be removed is no longer easily at hand.
>     
>     I looked at lines 1194, 1195 in _runTask():
>       Framework* framework = getFramework(frameworkId);
>       CHECK_NOTNULL(framework);
>       
>     If I removed the framework earlier than this, this check would fire. If I 
> want to avoid that, I need to write extra code (e.g. prevent _runTask() from 
> happening). Assembling all framework removals in the second half of 
> _runTask() looks like good defensive practice to me.
>     
>     There is no harm in removing the framework in _runTask() vs. killTask() 
> since _runTask() must eventually happen. But by waiting with removing the 
> framework until _runTask(), we do not need to think about what implications 
> concurrent starting and killing of multiple tasks with the same executor 
> might have.
> 
> Vinod Kone wrote:
>     I don't think I follow what you mean by
>     
>     "The task is removed in killTask(), because later on the information that 
> it should be removed is no longer easily at hand"
>     
>     and 
>     
>     "...we do not need to think about what implications concurrent starting 
> and killing of multiple tasks with the same executor might have"
>     
>     
>     Anyway, here's why I think we should remove the framework in killTask().
>     
>     It doesn't require non-local reasoning. Right now, someone reading the 
> killTask() method should know that a pending task means a "_runTask()" 
> dispatch is in the queue and that the executor and framework will be removed 
> in '_runTask()'. Instead if killTask() can remove the framework right away, 
> it's easy to understand.
>     
>     It is much simple for methods that take "FrameworkID" (instead of 
> Framework*) to check if the framework exists. If the framework doesn't exist, 
> for whatever reason, it could just bail.
>     
>     I think these semantics would make future contributors' life easy when 
> making changes to these methods.
>     
>     Changing the CHECK in '_runTask()' to an if condition doesn't seem too 
> bad to me.
>     
>     ```
>     _runTask()
>     {
>        if (framework == NULL) {
>          LOG(WARNING) << "Ignoring run task " << task........
>                       << " because the framework does not exist";
>                       
>          return;
>        }
>     }
>     ```
>     
>     Makes sense?


This is more confusing for the user. He will then ask himself "why does the 
framework not exist"? And then have to track back in the log to find that the 
task has been killed. Making it simpler for users trumps making it simpler for 
us Mesos developers IMHO.


- Bernd


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23912/#review56016
-----------------------------------------------------------


On Oct. 14, 2014, 2:54 p.m., Bernd Mathiske wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23912/
> -----------------------------------------------------------
> 
> (Updated Oct. 14, 2014, 2:54 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-947
>     https://issues.apache.org/jira/browse/MESOS-947
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixes MESOS-947 "Slave should properly handle a killTask() that arrives 
> between runTask() and _runTask()".
> 
> Slave::killTask() did not check for task in question combination to be 
> "pending" (i.e. Slave::runTask had happened, but Slave::_runTask had not yet) 
> and then erroneously assumed that Slave::runTask() had not been executed. The 
> task was then marked "LOST" instead of "KILLED". But Slave::runTask had 
> already scheduled Slave::_runTask to follow. Now the entry for being 
> "pending" is removed, and the task is marked "KILLED", and _runTask gets 
> informed about this. It checks whether the task in question is currently 
> "pending" and if it is not, then it infers that the task has been killed and 
> does not erroneously try to complete launching it.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 342b09fc084c20d98d096bb129830440179c092c 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/mesos.hpp 957e2233cc11c438fd80d3b6d1907a1223093104 
>   src/tests/mesos.cpp 3dcb2acd5ad4ab5e3a7b4fe524ee077558112773 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/23912/diff/
> 
> 
> Testing
> -------
> 
> Wrote a unit test that reliably created the situation described in the 
> ticket. Observed that TASK_LOST and the listed log output occurred. This 
> pointed directly to the lines in killTask() where the problem is rooted. Ran 
> the test after fixing, it succeeded. Checked the log. It looks like a "clean 
> kill" now :-)
> 
> 
> Thanks,
> 
> Bernd Mathiske
> 
>

Re: Review Request 23912: Fix MESOS-947: Slave should properly handle a killTask() that arrives between runTask() and _runTask()

Reply via email to