Chatted offline with Chun and Meng and suggested we take an explicit
approach of using process::Sequence to ensure ordered task delivery (this
would need to be done both in the master and agent).

On Thu, Mar 1, 2018 at 1:17 PM, Chun-Hung Hsiao <chhs...@mesosphere.io>
wrote:

> Some background for the bug AlexR and Meng found:
>
> In https://issues.apache.org/jira/browse/MESOS-1720,
> we introduce an ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
> calls to a new executor.
> The master would specify that the first call is the one to launch a new
> executor
> through the `launch_executor` field in
> `RunTaskMessage`/`RunTaskGroupMessage`,
> and the second one should use the existing executor launched by the first
> call.
> At the agent side, it will drop any task that want to launch an executor
> which is already existing,
> or any task that want to run on a non-existent executor.
>
> Running a task/task group goes through a series of continuations,
> one is `collect()` on the future that unschedule frameworks from being
> GC'ed:
> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
> another is `collect()` on task authorization:
> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
> Since these `collect()` calls run on individual actors, the futures of the
> `collect()` calls for
> two `LAUNCH`/`LAUNCH_GROUP` calls may returns out-of-order,
> even if the futures these two `collect()` wait for are satisfied in order
> (which is true).
>
> The result is that, if this race condition is triggered,
> the agent will try to run the second task/task group before the first one,
> and since the executor is supposed to be launched by the first one,
> the agent will end up sending `TASK_DROPPED` for the second call.
>
> If we can have an interface to make sure that `collect()` returns in the
> same order
> of their dependent futures, this can be avoided.
>
> On Mar 1, 2018 12:50 PM, "Benjamin Mahler" <bmah...@apache.org> wrote:
>
> > Could you explain the problem in more detail?
> >
> > On Thu, Mar 1, 2018 at 12:15 PM Chun-Hung Hsiao <chhs...@mesosphere.io>
> > wrote:
> >
> > > Hi all,
> > >
> > > Meng found a bug in `slave.cpp`, where the proper fix requires
> collecting
> > > futures in order. Currently every `collect` call spawns it's own actor,
> > so
> > > for two `collect` calls, even though their futures are satisfied in
> > order,
> > > they may finish out-of-order. So we need some libprocess changes to
> have
> > > the ability to collect futures in the same actor. Here I have two
> > > proposals:
> > >
> > > 1. Add a new `collect` interface that takes an actor as a parameter.
> > >
> > > 2. Introduce `process::Executor::collect()` for this.
> > >
> > > Any opinion on these two options?
> > >
> >
>

Reply via email to