Meng,

Could you double check if this is really an issue in Mesos 1.5.0 release?

MESOS-1720 <https://issues.apache.org/jira/browse/MESOS-1720> was resolved
after the 1.5 release (rc-2) and it seems like
it is only at the master branch and 1.5.x branch (not 1.5.0).

Did I miss anything?

- Gilbert

On Thu, Mar 1, 2018 at 4:22 PM, Benjamin Mahler <bmah...@apache.org> wrote:

> Put another way, we currently don't guarantee in-order task delivery to
> the executor. Due to the changes for MESOS-1720, one special case of task
> re-ordering now leads to the re-ordered task being dropped (rather than
> delivered out-of-order as before). Technically, this is strictly better.
>
> However, we'd like to start guaranteeing in-order task delivery.
>
> On Thu, Mar 1, 2018 at 2:56 PM, Meng Zhu <m...@mesosphere.com> wrote:
>
>> Hi all:
>>
>> TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
>> if all three conditions are met:
>> (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
>>  use the same executor.
>> (2) The executor currently does not exist on the agent.
>> (3) Due to some race conditions, these tasks are trying to launch
>> on the agent in a different order from their original launch order.
>>
>> In this case, tasks that are trying to launch on the agent
>> before the first task in the original order will be explicitly dropped by
>> the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).
>>
>> This bug will be fixed in 1.5.1. It is tracked in
>> https://issues.apache.org/jira/browse/MESOS-8624
>>
>> ----
>>
>> In https://issues.apache.org/jira/browse/MESOS-1720, we introduced an
>> ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
>> calls to a new executor. The master would specify that the first call is
>> the
>> one to launch a new executor through the `launch_executor` field in
>> `RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
>> use the existing executor launched by the first one.
>>
>> On the agent side, running a task/task group goes through a series of
>> continuations, one is `collect()` on the future that unschedule
>> frameworks from
>> being GC'ed:
>> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
>> another is `collect()` on task authorization:
>> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
>> Since these `collect()` calls run on individual actors, the futures of the
>> `collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
>> out-of-order, even if the futures these two `collect()` wait for are
>> satisfied in
>> order (which is true in these two cases).
>>
>> As a result, under some race conditions (probably under some heavy load
>> conditions), tasks rely on the previous task to launch executor may
>> get processed before the task that is supposed to launch the executor
>> first, resulting in the tasks being explicitly dropped by the agent.
>>
>> -Meng
>>
>>
>>
>

Reply via email to