Repository: mesos
Updated Branches:
  refs/heads/master c53de123a -> 6195e783d


Fixed race between {stop(), abort()} and join() in the scheduler driver.

Previously, it was possible for join() to return before a schedDriver
was actually fully stopped or aborted (breaking the semantics of the
join() call). The race came from a short circuit in join(), which
simply checked for status != DRIVER_RUNNING before returning. It appears
this short circuit was introduced to handle cases where initialize() or
start() ended up aborting before ever starting the driver to begin with.
However, it unintentionally covers cases where stop() or abort() were
called *after* the driver started running as well.

The problem is that stop() and abort() will change the status
to DRIVER_STOPPED or DRIVER_ABORTED before actually processing
dispatched stop or abort events (which happen asynchronously in a
libprocess thread). Under normal operation, join() would wait for these
events to trigger a latch that allowed the join() call to return.
However, with the short circuit, join() exits immediately even if the
libprocess thread hasn't yet processed the stop() or abort() events.

This commit fixes the semantics of the join() call to avoid this race.
We considered removing the latch completely and replacing it with
process.wait(), but, unlike the latch, this wouldn't ensure that stop()
or abort() was ever called in the first place.

Review: https://reviews.apache.org/r/42519/


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/f5f76701
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/f5f76701
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/f5f76701

Branch: refs/heads/master
Commit: f5f76701974482bbeacff4a347390287a788cf9d
Parents: c53de12
Author: Kevin Klues <[email protected]>
Authored: Thu Jan 21 23:12:00 2016 -0800
Committer: Benjamin Mahler <[email protected]>
Committed: Thu Jan 21 23:24:01 2016 -0800

----------------------------------------------------------------------
 src/sched/sched.cpp | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/f5f76701/src/sched/sched.cpp
----------------------------------------------------------------------
diff --git a/src/sched/sched.cpp b/src/sched/sched.cpp
index 38940b7..67038ef 100644
--- a/src/sched/sched.cpp
+++ b/src/sched/sched.cpp
@@ -1905,19 +1905,20 @@ Status MesosSchedulerDriver::abort()
 
 Status MesosSchedulerDriver::join()
 {
-  // Exit early if the driver is not running.
+  // We can use the process pointer to detect if the driver was ever
+  // started properly. If it wasn't, we return the current status
+  // (which should either be DRIVER_NOT_STARTED or DRIVER_ABORTED).
   synchronized (mutex) {
-    if (status != DRIVER_RUNNING) {
+    if (process == NULL) {
+      CHECK(status == DRIVER_NOT_STARTED || status == DRIVER_ABORTED);
+
       return status;
     }
   }
 
-  // If the driver was running, the latch will be triggered regardless
-  // of the current `status`. Wait for this to happen to signify
-  // termination.
+  // Otherwise, wait for stop() or abort() to trigger the latch.
   CHECK_NOTNULL(latch)->await();
 
-  // Now return the current `status` of the driver.
   synchronized (mutex) {
     CHECK(status == DRIVER_ABORTED || status == DRIVER_STOPPED);
 

Reply via email to