[ 
https://issues.apache.org/jira/browse/MESOS-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005133#comment-14005133
 ] 

Benjamin Mahler commented on MESOS-1389:
----------------------------------------

Because the master does not handle status update acknowledgements, to have all 
acknowledgments going through the master we'll unfortunately need to release 
the change in 2 phases. This means 0.19.0 will be stuck with this 
reconciliation bug:

*0.19.0*: Implement status update acknowledgement handling in the Master. 
Ensure the Slave can handle acknowledgements coming from both the Master and 
the Scheduler Driver. It is *key* that the slave ignores any acknowledgements 
coming from the non-leading Master (see below).

*0.20.0*: Update the scheduler driver to send status updates to the Master 
always.

To ensure that this is done correctly, the Slave must reject acknowledgments 
from the non-leading master. This is required to prevent the following:

1. Slave re-registers with the master with a terminal task T, since the 
terminal status has not been acknowledged.
2. The slave receives a stale acknowledgment message from the *old* master, 
stops retrying the update.
3. The leading master is stuck with the task T in its memory, since the slave 
will never retry the update.

> Reconciliation can send TASK_LOST before a terminal update reaches the 
> framework.
> ---------------------------------------------------------------------------------
>
>                 Key: MESOS-1389
>                 URL: https://issues.apache.org/jira/browse/MESOS-1389
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Benjamin Mahler
>             Fix For: 0.19.0
>
>
> There's an unfortunate case with reconciliation, where we end up sending 
> TASK_LOST first and then the slave sends the valid terminal status update.
> When the slave re-registers with terminal tasks that have un-acked updates. 
> The master does not store these tasks. So while the slave still needs to send 
> the terminal status updates, the master will reply with TASK_LOST for 
> reconciliation.
> We may need to ensure that all status update acknowledgements go through the 
> master to fix this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to