[
https://issues.apache.org/jira/browse/MESOS-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Mahler reassigned MESOS-851:
-------------------------------------
Assignee: Benjamin Mahler
> Scheduler Driver does not guarantee that abort() prevents further calls on
> the Scheduler.
> -----------------------------------------------------------------------------------------
>
> Key: MESOS-851
> URL: https://issues.apache.org/jira/browse/MESOS-851
> Project: Mesos
> Issue Type: Bug
> Components: c++ api, java api, python api
> Reporter: Benjamin Mahler
> Assignee: Benjamin Mahler
> Priority: Critical
> Fix For: 0.16.0
>
>
> This came up while reviewing: https://reviews.apache.org/r/15853/
> Our documentation for abort mentions that no more callbacks can be made to
> the scheduler:
> /**
> * Aborts the driver so that no more callbacks can be made to the
> * scheduler. The semantics of abort and stop have deliberately been
> * separated so that code can detect an aborted driver (i.e., via
> * the return status of SchedulerDriver::join, see below), and
> * instantiate and start another driver if desired (from within the
> * same process). Note that 'stop()' is not automatically called
> * inside 'abort()'.
> */
> virtual Status abort() = 0;
> However, this is inaccurate as we perform a dispatch to the SchedulerProcess
> which means that any already queued messages will be processed prior to abort:
> Status MesosSchedulerDriver::abort()
> {
> Lock lock(&mutex);
> if (status != DRIVER_RUNNING) {
> return status;
> }
> CHECK(process != NULL);
> // XXX: This does not immediately signal the SchedulerProcess to stop
> // processing messages!
> dispatch(process, &SchedulerProcess::abort);
> return status = DRIVER_ABORTED;
> }
> The driver's stop() call has a similar issue in terms of possibly making
> additional calls on the Scheduler after stop() is called.
> This problem is mirrored in the ExecutorDriver's stop and abort functions as
> well.
> So far, I see a few possible fixes:
> 1. Expose the 'volatile bool aborted' member variable of SchedulerProcess and
> set it inside MesosSchedulerDriver::abort. stop() would need a similar
> boolean.
> 2. Provide a "priority dispatch" mechanism in libprocess, wherein the
> DispatchEvent can be sent to the front of the queue. (stop() can also use
> this).
> 3. Terminate the process when abort/stop are called and handle it
> appropriately in the finalize() function, however, this changes the existing
> functionality in that schedulers can no longer make driver calls to kill
> tasks, launch tasks, etc after being aborted.
--
This message was sent by Atlassian JIRA
(v6.1#6144)