> On Jan. 22, 2016, 6:58 a.m., Ben Mahler wrote:
> > Logically looks good, just a couple of trivial comments and we can get this 
> > landed!

I was going to start fixing things up, but it looks like you took care of it 
for me.  Thanks!


- Kevin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42519/#review115796
-----------------------------------------------------------


On Jan. 19, 2016, 10:58 p.m., Kevin Klues wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42519/
> -----------------------------------------------------------
> 
> (Updated Jan. 19, 2016, 10:58 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Greg Mann.
> 
> 
> Bugs: MESOS-4409
>     https://issues.apache.org/jira/browse/MESOS-4409
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Previously, it was possible for join() to return before a schedDriver
> was actually fully stopped or aborted (breaking the semantics of the
> join() call). The race came from a short circuit in join(), which
> simply checked for status != DRIVER_RUNNING before returning. It appears
> this short circuit was introduced to handle cases where initialize() or
> start() ended up aborting before ever starting the driver to begin with.
> However, it unintentionally covers cases where stop() or abort() were
> called *after* the driver started running as well.
> 
> The problem is that stop() and abort() will change the status
> to DRIVER_STOPPED or DRIVER_ABORTED before actually processing
> dispatched stop or abort events (which happen asynchronously in a
> libprocess thread). Under normal operation, join() would wait for these
> events to trigger a latch that allowed the join() call to return.
> However, with the short circuit, join() exits immediately even if the
> libprocess thread hasn't yet processed the stop() or abort() events.
> 
> This commit fixes the semantics of the join() call to avoid this race.
> We considered removing the latch completely and replacing it with
> process.wait(), but, unlike the latch, this wouldn't ensure that stop()
> or abort() was ever called in the first place.
> 
> 
> Diffs
> -----
> 
>   src/sched/sched.cpp 38940b7e2563a2156be2f8c228afdc27c45b6e83 
> 
> Diff: https://reviews.apache.org/r/42519/diff/
> 
> 
> Testing
> -------
> 
> Ran the entire 'make check' suite with no failures on both Mac OS X and 
> ubuntu 14.04.
> 
> 
> Thanks,
> 
> Kevin Klues
> 
>

Reply via email to