Yan Xu created MESOS-1646:
-----------------------------
Summary: TASK_LOST due to terminated executor waiting for status
update acknowledgment
Key: MESOS-1646
URL: https://issues.apache.org/jira/browse/MESOS-1646
Project: Mesos
Issue Type: Bug
Reporter: Yan Xu
There are executor implementations such as [ThermosGCExecutor
|https://github.com/apache/incub
ator-aurora/blob/c97ab632750e6e4abc685c9bfd3eea11354dd1e7/src/main/python/apache/aurora/executor/gc_executor.py#L457]
that commit seppuku when some *local* criteria are met. After slave realizes
the termination of the executor it transitions it into a TERMINATED state but
keeps it around until all the status updates for this executor has been
acknowledged by the scheduler.
Between the time the slave knows about the executor's termination and the time
all status updates are acknowledged the slave can receive more tasks for this
(already terminated but unbeknownst to the scheduler) executor, in which case
the slave sends these tasks to TASK_LOST.
This feels like a semantically correct behavior but when master load peaks the
delay for the status update can be long and master can suddenly see many
TASK_LOSTs from different slaves simultaneously after it sends these "stale"
tasks to terminated executors (due to of the delay).
I think we want to mitigate the sudden surge of lost tasks which can be
alarming and we can't differentiate them from other more serious situations
(*maybe by introducing another {{TaskState}}*) but it still sounds correct to
me that the slave should reject tasks which are looking for a terminated
executorID. Maybe there should be better semantics to *require the scheduler to
initiate the executor's seppuku* so it won't send tasks that only reach the
slave too late for the executor's lifespan?
--
This message was sent by Atlassian JIRA
(v6.2#6252)