Benjamin Mahler created MESOS-851:
-------------------------------------

             Summary: Scheduler Driver does not guarantee that abort() prevents 
further calls on the Scheduler.
                 Key: MESOS-851
                 URL: https://issues.apache.org/jira/browse/MESOS-851
             Project: Mesos
          Issue Type: Bug
          Components: c++ api, java api, python api
            Reporter: Benjamin Mahler
            Priority: Critical
             Fix For: 0.16.0


This came up while reviewing: https://reviews.apache.org/r/15853/

Our documentation for abort mentions that no more callbacks can be made to the 
scheduler:
  /**
   * Aborts the driver so that no more callbacks can be made to the
   * scheduler. The semantics of abort and stop have deliberately been
   * separated so that code can detect an aborted driver (i.e., via
   * the return status of SchedulerDriver::join, see below), and
   * instantiate and start another driver if desired (from within the
   * same process). Note that 'stop()' is not automatically called
   * inside 'abort()'.
   */
  virtual Status abort() = 0;

However, this is inaccurate as we perform a dispatch to the SchedulerProcess 
which means that any already queued messages will be processed prior to abort:


Status MesosSchedulerDriver::abort()
{
  Lock lock(&mutex);

  if (status != DRIVER_RUNNING) {
    return status;
  }

  CHECK(process != NULL);

  // XXX: This does not immediately signal the SchedulerProcess to stop
  // processing messages!
  dispatch(process, &SchedulerProcess::abort);

  return status = DRIVER_ABORTED;
}

The driver's stop() call has a similar issue in terms of possibly making 
additional calls on the Scheduler after stop() is called.

This problem is mirrored in the ExecutorDriver's stop and abort functions as 
well.

So far, I see a few possible fixes:

1. Expose the 'volatile bool aborted' member variable of SchedulerProcess and 
set it inside MesosSchedulerDriver::abort. stop() would need a similar boolean.

2. Provide a "priority dispatch" mechanism in libprocess, wherein the 
DispatchEvent can be sent to the front of the queue. (stop() can also use this).

3. Terminate the process when abort/stop are called and handle it appropriately 
in the finalize() function, however, this changes the existing functionality in 
that schedulers can no longer make driver calls to kill tasks, launch tasks, 
etc after being aborted.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to