[
https://issues.apache.org/jira/browse/MESOS-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005133#comment-14005133
]
Benjamin Mahler commented on MESOS-1389:
----------------------------------------
Because the master does not handle status update acknowledgements, to have all
acknowledgments going through the master we'll unfortunately need to release
the change in 2 phases. This means 0.19.0 will be stuck with this
reconciliation bug:
*0.19.0*: Implement status update acknowledgement handling in the Master.
Ensure the Slave can handle acknowledgements coming from both the Master and
the Scheduler Driver. It is *key* that the slave ignores any acknowledgements
coming from the non-leading Master (see below).
*0.20.0*: Update the scheduler driver to send status updates to the Master
always.
To ensure that this is done correctly, the Slave must reject acknowledgments
from the non-leading master. This is required to prevent the following:
1. Slave re-registers with the master with a terminal task T, since the
terminal status has not been acknowledged.
2. The slave receives a stale acknowledgment message from the *old* master,
stops retrying the update.
3. The leading master is stuck with the task T in its memory, since the slave
will never retry the update.
> Reconciliation can send TASK_LOST before a terminal update reaches the
> framework.
> ---------------------------------------------------------------------------------
>
> Key: MESOS-1389
> URL: https://issues.apache.org/jira/browse/MESOS-1389
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.19.0
> Reporter: Benjamin Mahler
> Fix For: 0.19.0
>
>
> There's an unfortunate case with reconciliation, where we end up sending
> TASK_LOST first and then the slave sends the valid terminal status update.
> When the slave re-registers with terminal tasks that have un-acked updates.
> The master does not store these tasks. So while the slave still needs to send
> the terminal status updates, the master will reply with TASK_LOST for
> reconciliation.
> We may need to ensure that all status update acknowledgements go through the
> master to fix this.
--
This message was sent by Atlassian JIRA
(v6.2#6252)