Yan Xu created MESOS-1646:
-----------------------------

             Summary: TASK_LOST due to terminated executor waiting for status 
update acknowledgment
                 Key: MESOS-1646
                 URL: https://issues.apache.org/jira/browse/MESOS-1646
             Project: Mesos
          Issue Type: Bug
            Reporter: Yan Xu


There are executor implementations such as [ThermosGCExecutor 
|https://github.com/apache/incub 
ator-aurora/blob/c97ab632750e6e4abc685c9bfd3eea11354dd1e7/src/main/python/apache/aurora/executor/gc_executor.py#L457]
 that commit seppuku when some *local* criteria are met. After slave realizes 
the termination of the executor it transitions it into a TERMINATED state but 
keeps it around until all the status updates for this executor has been 
acknowledged by the scheduler.

Between the time the slave knows about the executor's termination and the time 
all status updates are acknowledged the slave can receive more tasks for this 
(already terminated but unbeknownst to the scheduler) executor, in which case 
the slave sends these tasks to TASK_LOST.

This feels like a semantically correct behavior but when master load peaks the 
delay for the status update can be long and master can suddenly see many 
TASK_LOSTs from different slaves simultaneously after it sends these "stale" 
tasks to terminated executors (due to of the delay).

I think we want to mitigate the sudden surge of lost tasks which can be 
alarming and we can't differentiate them from other more serious situations 
(*maybe by introducing another {{TaskState}}*) but it still sounds correct to 
me that the slave should reject tasks which are looking for a terminated 
executorID. Maybe there should be better semantics to *require the scheduler to 
initiate the executor's seppuku* so it won't send tasks that only reach the 
slave too late for the executor's lifespan?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to