Re: Review Request 15745: Fixed some task reconciliation cases.

Ben Mahler Thu, 21 Nov 2013 14:04:41 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29249
-----------------------------------------------------------

Slaves continue to retry status updates until an acknowledgement is received 
(the scheduler driver sends an acknowledgement once Scheduler::statusUpdate was 
invoked). So in Case 1 you've described, if a task transitions while a 
framework if failing over, the slave will continue to retry sending the status 
update. When the framework reconnects to the master, the retried status update 
would be delivered and so reconciliation of this would be unnecessary and 
possibly error prone:

Normally:

Framework fails over with T1 in STARTING.
T1 transitions from STARTING->RUNNING->FINISHED in the slave and master.
Framework reconnects thinking T1 is STARTING.
First the slave retries RUNNING and this now gets processed by the Framework. 
Acknowledgement is sent to the slave.
Then the slave retries FINISHED and this now gets processed by the Framework. 
Acknowledgement is sent to the slave.

Framework sees the following order: STARTING -> RUNNING -> FINISHED

Assuming reconciliation Case 1 here:
Framework fails over with T1 in STARTING.
T1 transitions from STARTING->RUNNING->FINISHED in the slave and master.
Framework reconnects thinking T1 is STARTING.
Framework reconciles and master sends FINISHED.
The slave retries RUNNING and this now gets processed by the Framework. 
Acknowledgement is sent to the slave.
Then the slave retries FINISHED and this now gets processed by the Framework. 
Acknowledgement is sent to the slave.

Framework sees the following order: STARTING -> FINISHED -> RUNNING -> FINISHED

Case 1 here can unfortunately lead to out-of-order update delivery to 
frameworks.

- Ben Mahler

On Nov. 21, 2013, 12:30 a.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Re: Review Request 15745: Fixed some task reconciliation cases.

Reply via email to