[ 
https://issues.apache.org/jira/browse/MESOS-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011659#comment-14011659
 ] 

Benjamin Mahler commented on MESOS-1388:
----------------------------------------

https://reviews.apache.org/r/21991/

> Inconsistent terminal task state between master and re-registering slave
> ------------------------------------------------------------------------
>
>                 Key: MESOS-1388
>                 URL: https://issues.apache.org/jira/browse/MESOS-1388
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Vinod Kone
>            Assignee: Benjamin Mahler
>
> The following is a sequence of events that could result in master sending 
> TASK_LOST and then TASK_FINISHED for a task to a framework.
> --> Master failed over
> --> Slaves tries to re-register with Master w/ a running task (T)
> --> Master starts re-admission into the registry
> --> Task finishes and slave removes it from its map
> --> The TASK_FINISHED status update is dropped by master as re-admission is 
> in progress
> --> The executor terminates on the slave.
> --> Slave retries re-registration (w/o task T) as master is still busy 
> re-admitting it and hasn't ACKed the re-registration yet
> --> Master finally finishes re-admission and re-adds slave with task T
> --> Master gets a duplicate/enqueued re-registration request (w/o task T) 
> that results in the master sending TASK_LOST during reconciliation.
> --> Master now gets retried TASK_FINISHED update from the slave which it 
> forwards to the scheduler.
> Normally, the slave re-registers and includes terminal unacknowledged tasks 
> in the message to the master. However, when the executor is terminated, the 
> slave does not send any of its tasks. This is problematic when there are 
> unacknowledged updates for tasks ran by the executor.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to