[ 
https://issues.apache.org/jira/browse/MESOS-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-851:
-------------------------------------

    Assignee: Benjamin Mahler

> Scheduler Driver does not guarantee that abort() prevents further calls on 
> the Scheduler.
> -----------------------------------------------------------------------------------------
>
>                 Key: MESOS-851
>                 URL: https://issues.apache.org/jira/browse/MESOS-851
>             Project: Mesos
>          Issue Type: Bug
>          Components: c++ api, java api, python api
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>            Priority: Critical
>             Fix For: 0.16.0
>
>
> This came up while reviewing: https://reviews.apache.org/r/15853/
> Our documentation for abort mentions that no more callbacks can be made to 
> the scheduler:
>   /**
>    * Aborts the driver so that no more callbacks can be made to the
>    * scheduler. The semantics of abort and stop have deliberately been
>    * separated so that code can detect an aborted driver (i.e., via
>    * the return status of SchedulerDriver::join, see below), and
>    * instantiate and start another driver if desired (from within the
>    * same process). Note that 'stop()' is not automatically called
>    * inside 'abort()'.
>    */
>   virtual Status abort() = 0;
> However, this is inaccurate as we perform a dispatch to the SchedulerProcess 
> which means that any already queued messages will be processed prior to abort:
> Status MesosSchedulerDriver::abort()
> {
>   Lock lock(&mutex);
>   if (status != DRIVER_RUNNING) {
>     return status;
>   }
>   CHECK(process != NULL);
>   // XXX: This does not immediately signal the SchedulerProcess to stop
>   // processing messages!
>   dispatch(process, &SchedulerProcess::abort);
>   return status = DRIVER_ABORTED;
> }
> The driver's stop() call has a similar issue in terms of possibly making 
> additional calls on the Scheduler after stop() is called.
> This problem is mirrored in the ExecutorDriver's stop and abort functions as 
> well.
> So far, I see a few possible fixes:
> 1. Expose the 'volatile bool aborted' member variable of SchedulerProcess and 
> set it inside MesosSchedulerDriver::abort. stop() would need a similar 
> boolean.
> 2. Provide a "priority dispatch" mechanism in libprocess, wherein the 
> DispatchEvent can be sent to the front of the queue. (stop() can also use 
> this).
> 3. Terminate the process when abort/stop are called and handle it 
> appropriately in the finalize() function, however, this changes the existing 
> functionality in that schedulers can no longer make driver calls to kill 
> tasks, launch tasks, etc after being aborted.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to