> On Jan. 22, 2016, 6:58 a.m., Ben Mahler wrote: > > Logically looks good, just a couple of trivial comments and we can get this > > landed!
I was going to start fixing things up, but it looks like you took care of it for me. Thanks! - Kevin ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/42519/#review115796 ----------------------------------------------------------- On Jan. 19, 2016, 10:58 p.m., Kevin Klues wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/42519/ > ----------------------------------------------------------- > > (Updated Jan. 19, 2016, 10:58 p.m.) > > > Review request for mesos, Ben Mahler and Greg Mann. > > > Bugs: MESOS-4409 > https://issues.apache.org/jira/browse/MESOS-4409 > > > Repository: mesos > > > Description > ------- > > Previously, it was possible for join() to return before a schedDriver > was actually fully stopped or aborted (breaking the semantics of the > join() call). The race came from a short circuit in join(), which > simply checked for status != DRIVER_RUNNING before returning. It appears > this short circuit was introduced to handle cases where initialize() or > start() ended up aborting before ever starting the driver to begin with. > However, it unintentionally covers cases where stop() or abort() were > called *after* the driver started running as well. > > The problem is that stop() and abort() will change the status > to DRIVER_STOPPED or DRIVER_ABORTED before actually processing > dispatched stop or abort events (which happen asynchronously in a > libprocess thread). Under normal operation, join() would wait for these > events to trigger a latch that allowed the join() call to return. > However, with the short circuit, join() exits immediately even if the > libprocess thread hasn't yet processed the stop() or abort() events. > > This commit fixes the semantics of the join() call to avoid this race. > We considered removing the latch completely and replacing it with > process.wait(), but, unlike the latch, this wouldn't ensure that stop() > or abort() was ever called in the first place. > > > Diffs > ----- > > src/sched/sched.cpp 38940b7e2563a2156be2f8c228afdc27c45b6e83 > > Diff: https://reviews.apache.org/r/42519/diff/ > > > Testing > ------- > > Ran the entire 'make check' suite with no failures on both Mac OS X and > ubuntu 14.04. > > > Thanks, > > Kevin Klues > >
