Meng Zhu created MESOS-8459:
-------------------------------

             Summary: Executor could linger without ever receiving any tasks
                 Key: MESOS-8459
                 URL: https://issues.apache.org/jira/browse/MESOS-8459
             Project: Mesos
          Issue Type: Bug
          Components: executor
            Reporter: Meng Zhu


An executor's initial tasks may be killed even after it has been registered. In 
that case, the executor could linger forever.

In MESOS-8411, we have a short-term fix that checks an executor's completed and 
terminated task queues to see if it had ever received any tasks. if the check 
is false and there is no queued or launched tasks, agent will shutdown the 
executor. 

However, this check is not bullet-proof. The completedTasks queue is a 
circular_buffer (current size 200) which means earlier completed tasks that are 
possibly updated by the executor may be ejected and thus are missed by this 
check. This would lead to false positive shutdowns.

Per discussion with [~vinodkone] and [~bmahler]. There are two long term 
solutions.

The first one is to checkpoint additional executor states which indicates 
whether the executor has ever received any tasks (no more inference from task 
queue status);

The alternative is to add timeouts in the executor driver to shutdown lingering 
executors automatically.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to