[ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415612#comment-16415612
 ] 

Benno Evers commented on MESOS-1466:
------------------------------------

If I understand the issue correctly, this race seems to have been eliminated as 
a side-effect of introducing the `launch_executor` flag in Mesos 1.5:

When the master sends the `RunTaskMessage` to the agent, it thinks that the 
specified executor is still running on the agent, so it will set 
`launch_executor = false`:
{noformat}
// src/master/master.cpp:3841
bool Master::isLaunchExecutor(
    const ExecutorID& executorId,
    Framework* framework,
    Slave* slave) const
{
  CHECK_NOTNULL(framework);
  CHECK_NOTNULL(slave);

  if (!slave->hasExecutor(framework->id(), executorId)) {
    CHECK(!framework->hasExecutor(slave->id, executorId))
      << "Executor '" << executorId
      << "' known to the framework " << *framework
      << " but unknown to the agent " << *slave;

    return true;
  }

  return false;
}{noformat}
On the slave, when the executor doesn't exist anymore, the task is dropped with 
reason `REASON_EXECUTOR_TERMINATED`:
{noformat}
// src/slave/slave.cpp:2881

        // Master does not want to launch executor.
        if (executor == nullptr) {
          // Master wants no new executor launched and there is none running on
          // the agent. This could happen if the task expects some previous
          // tasks to launch the executor. However, the earlier task got killed
          // or dropped hence did not launch the executor but the master doesn't
          // know about it yet because the `ExitedExecutorMessage` is still in
          // flight. In this case, we will drop the task.
          //
          // We report TASK_DROPPED to the framework because the task was
          // never launched. For non-partition-aware frameworks, we report
          // TASK_LOST for backward compatibility.
          mesos::TaskState taskState = TASK_DROPPED;
          if (!protobuf::frameworkHasCapability(
              frameworkInfo, FrameworkInfo::Capability::PARTITION_AWARE)) {
            taskState = TASK_LOST;
          }

          foreach (const TaskInfo& _task, tasks) {
            const StatusUpdate update = protobuf::createStatusUpdate(
                frameworkId,
                info.id(),
                _task.task_id(),
                taskState,
                TaskStatus::SOURCE_SLAVE,
                id::UUID::random(),
                "No executor is expected to launch and there is none running",
                TaskStatus::REASON_EXECUTOR_TERMINATED,
                executorId);

            statusUpdate(update, UPID());
          }

          // We do not send `ExitedExecutorMessage` here because the expectation
          // is that there is already one on the fly to master. If the message
          // gets dropped, we will hopefully reconcile with the master later.

          return;
        }{noformat}

> Race between executor exited event and launch task can cause overcommit of 
> resources
> ------------------------------------------------------------------------------------
>
>                 Key: MESOS-1466
>                 URL: https://issues.apache.org/jira/browse/MESOS-1466
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation, master
>            Reporter: Vinod Kone
>            Priority: Major
>              Labels: reliability, twitter
>
> The following sequence of events can cause an overcommit
> --> Launch task is called for a task whose executor is already running
> --> Executor's resources are not accounted for on the master
> --> Executor exits and the event is enqueued behind launch tasks on the master
> --> Master sends the task to the slave which needs to commit for resources 
> for task and the (new) executor.
> --> Master processes the executor exited event and re-offers the executor's 
> resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to