[
https://issues.apache.org/jira/browse/MESOS-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Mahler updated MESOS-1388:
-----------------------------------
Description:
The following is a sequence of events that could result in master sending
TASK_LOST and then TASK_FINISHED for a task to a framework.
--> Master failed over
--> Slaves tries to re-register with Master w/ a running task (T)
--> Master starts re-admission into the registry
--> Task finishes and slave removes it from its map
--> The TASK_FINISHED status update is dropped by master as re-admission is in
progress
--> The executor terminates on the slave.
--> Slave retries re-registration (w/o task T) as master is still busy
re-admitting it and hasn't ACKed the re-registration yet
--> Master finally finishes re-admission and re-adds slave with task T
--> Master gets a duplicate/enqueued re-registration request (w/o task T) that
results in the master sending TASK_LOST during reconciliation.
--> Master now gets retried TASK_FINISHED update from the slave which it
forwards to the scheduler.
Normally, the slave re-registers and includes terminal unacknowledged tasks in
the message to the master. However, when the executor is terminated, the slave
does not send any of its tasks. This is problematic when there are
unacknowledged updates for tasks ran by the executor.
was:
The following is a sequence of events that could result in master sending
TASK_LOST and then TASK_FINISHED for a task to a framework.
--> Master failed over
--> Slaves tries to re-register with Master w/ a running task (T)
--> Master starts re-admission into the registry
--> Task finishes and slave removes it from its map
--> The TASK_FINISHED status update is dropped by master as re-admission is in
progress
--> Slave retries re-registration (w/o task T) as master is still busy
re-admitting it and hasn't ACKed the re-registration yet
--> Master finally finishes re-admission and re-adds slave with task T
--> Master gets a duplicate/enqueued re-registration request (w/o task T) that
results in the master sending TASK_LOST during reconciliation.
--> Master now gets retried TASK_FINISHED update from the slave which it
forwards to the scheduler.
The crux of the issue is that the master doesn't know about tasks in terminal
states that belong to a re-registering slave. The right way to fix this issue
is to have slave re-registering with tasks that have pending terminal updates
and possibly having ACKs go through the master.
> Inconsistent terminal task state between master and re-registering slave
> ------------------------------------------------------------------------
>
> Key: MESOS-1388
> URL: https://issues.apache.org/jira/browse/MESOS-1388
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.19.0
> Reporter: Vinod Kone
>
> The following is a sequence of events that could result in master sending
> TASK_LOST and then TASK_FINISHED for a task to a framework.
> --> Master failed over
> --> Slaves tries to re-register with Master w/ a running task (T)
> --> Master starts re-admission into the registry
> --> Task finishes and slave removes it from its map
> --> The TASK_FINISHED status update is dropped by master as re-admission is
> in progress
> --> The executor terminates on the slave.
> --> Slave retries re-registration (w/o task T) as master is still busy
> re-admitting it and hasn't ACKed the re-registration yet
> --> Master finally finishes re-admission and re-adds slave with task T
> --> Master gets a duplicate/enqueued re-registration request (w/o task T)
> that results in the master sending TASK_LOST during reconciliation.
> --> Master now gets retried TASK_FINISHED update from the slave which it
> forwards to the scheduler.
> Normally, the slave re-registers and includes terminal unacknowledged tasks
> in the message to the master. However, when the executor is terminated, the
> slave does not send any of its tasks. This is problematic when there are
> unacknowledged updates for tasks ran by the executor.
--
This message was sent by Atlassian JIRA
(v6.2#6252)