----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/42519/#review115796 -----------------------------------------------------------
Logically looks good, just a couple of trivial comments and we can get this landed! src/sched/sched.cpp (lines 1912 - 1913) <https://reviews.apache.org/r/42519/#comment176853> How about a return of the status inside this 'if' block, and no 'else' block? That would be a little more consistent with our coding style (and the original structure of join() before this patch). src/sched/sched.cpp (line 1918) <https://reviews.apache.org/r/42519/#comment176852> We tend to use newlines between logical blocks of code, so there should be one after the closing brace here. src/sched/sched.cpp (lines 1919 - 1922) <https://reviews.apache.org/r/42519/#comment176851> Ok, so we don't need to, but how about we lock it just to keep all access to `status` synchronized? This seems pretty subtle to explain? - Ben Mahler On Jan. 19, 2016, 10:58 p.m., Kevin Klues wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/42519/ > ----------------------------------------------------------- > > (Updated Jan. 19, 2016, 10:58 p.m.) > > > Review request for mesos, Ben Mahler and Greg Mann. > > > Bugs: MESOS-4409 > https://issues.apache.org/jira/browse/MESOS-4409 > > > Repository: mesos > > > Description > ------- > > Previously, it was possible for join() to return before a schedDriver > was actually fully stopped or aborted (breaking the semantics of the > join() call). The race came from a short circuit in join(), which > simply checked for status != DRIVER_RUNNING before returning. It appears > this short circuit was introduced to handle cases where initialize() or > start() ended up aborting before ever starting the driver to begin with. > However, it unintentionally covers cases where stop() or abort() were > called *after* the driver started running as well. > > The problem is that stop() and abort() will change the status > to DRIVER_STOPPED or DRIVER_ABORTED before actually processing > dispatched stop or abort events (which happen asynchronously in a > libprocess thread). Under normal operation, join() would wait for these > events to trigger a latch that allowed the join() call to return. > However, with the short circuit, join() exits immediately even if the > libprocess thread hasn't yet processed the stop() or abort() events. > > This commit fixes the semantics of the join() call to avoid this race. > We considered removing the latch completely and replacing it with > process.wait(), but, unlike the latch, this wouldn't ensure that stop() > or abort() was ever called in the first place. > > > Diffs > ----- > > src/sched/sched.cpp 38940b7e2563a2156be2f8c228afdc27c45b6e83 > > Diff: https://reviews.apache.org/r/42519/diff/ > > > Testing > ------- > > Ran the entire 'make check' suite with no failures on both Mac OS X and > ubuntu 14.04. > > > Thanks, > > Kevin Klues > >
