[ 
https://issues.apache.org/jira/browse/MESOS-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1388:
-----------------------------------

    Description: 
The following is a sequence of events that could result in master sending 
TASK_LOST and then TASK_FINISHED for a task to a framework.

--> Master failed over
--> Slaves tries to re-register with Master w/ a running task (T)
--> Master starts re-admission into the registry
--> Task finishes and slave removes it from its map
--> The TASK_FINISHED status update is dropped by master as re-admission is in 
progress
--> The executor terminates on the slave.
--> Slave retries re-registration (w/o task T) as master is still busy 
re-admitting it and hasn't ACKed the re-registration yet
--> Master finally finishes re-admission and re-adds slave with task T
--> Master gets a duplicate/enqueued re-registration request (w/o task T) that 
results in the master sending TASK_LOST during reconciliation.
--> Master now gets retried TASK_FINISHED update from the slave which it 
forwards to the scheduler.

Normally, the slave re-registers and includes terminal unacknowledged tasks in 
the message to the master. However, when the executor is terminated, the slave 
does not send any of its tasks. This is problematic when there are 
unacknowledged updates for tasks ran by the executor.

  was:
The following is a sequence of events that could result in master sending 
TASK_LOST and then TASK_FINISHED for a task to a framework.

--> Master failed over
--> Slaves tries to re-register with Master w/ a running task (T)
--> Master starts re-admission into the registry
--> Task finishes and slave removes it from its map
--> The TASK_FINISHED status update is dropped by master as re-admission is in 
progress
--> Slave retries re-registration (w/o task T) as master is still busy 
re-admitting it and hasn't ACKed the re-registration yet
--> Master finally finishes re-admission and re-adds slave with task T
--> Master gets a duplicate/enqueued re-registration request (w/o task T) that 
results in the master sending TASK_LOST during reconciliation.
--> Master now gets retried TASK_FINISHED update from the slave which it 
forwards to the scheduler.


The crux of the issue is that the master doesn't know about tasks in terminal 
states that belong to a re-registering slave. The right way to fix this issue 
is to have slave re-registering with tasks that have pending terminal updates 
and possibly having ACKs go through the master.



> Inconsistent terminal task state between master and re-registering slave
> ------------------------------------------------------------------------
>
>                 Key: MESOS-1388
>                 URL: https://issues.apache.org/jira/browse/MESOS-1388
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Vinod Kone
>
> The following is a sequence of events that could result in master sending 
> TASK_LOST and then TASK_FINISHED for a task to a framework.
> --> Master failed over
> --> Slaves tries to re-register with Master w/ a running task (T)
> --> Master starts re-admission into the registry
> --> Task finishes and slave removes it from its map
> --> The TASK_FINISHED status update is dropped by master as re-admission is 
> in progress
> --> The executor terminates on the slave.
> --> Slave retries re-registration (w/o task T) as master is still busy 
> re-admitting it and hasn't ACKed the re-registration yet
> --> Master finally finishes re-admission and re-adds slave with task T
> --> Master gets a duplicate/enqueued re-registration request (w/o task T) 
> that results in the master sending TASK_LOST during reconciliation.
> --> Master now gets retried TASK_FINISHED update from the slave which it 
> forwards to the scheduler.
> Normally, the slave re-registers and includes terminal unacknowledged tasks 
> in the message to the master. However, when the executor is terminated, the 
> slave does not send any of its tasks. This is problematic when there are 
> unacknowledged updates for tasks ran by the executor.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to