Hi all:

TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
if all three conditions are met:
(1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
 use the same executor.
(2) The executor currently does not exist on the agent.
(3) Due to some race conditions, these tasks are trying to launch
on the agent in a different order from their original launch order.

In this case, tasks that are trying to launch on the agent
before the first task in the original order will be explicitly dropped by
the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).

This bug will be fixed in 1.5.1. It is tracked in
https://issues.apache.org/jira/browse/MESOS-8624

----

In https://issues.apache.org/jira/browse/MESOS-1720, we introduced an
ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
calls to a new executor. The master would specify that the first call is the
one to launch a new executor through the `launch_executor` field in
`RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
use the existing executor launched by the first one.

On the agent side, running a task/task group goes through a series of
continuations, one is `collect()` on the future that unschedule frameworks
from
being GC'ed:
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
another is `collect()` on task authorization:
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
Since these `collect()` calls run on individual actors, the futures of the
`collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
out-of-order, even if the futures these two `collect()` wait for are
satisfied in
order (which is true in these two cases).

As a result, under some race conditions (probably under some heavy load
conditions), tasks rely on the previous task to launch executor may
get processed before the task that is supposed to launch the executor
first, resulting in the tasks being explicitly dropped by the agent.

-Meng

Reply via email to