> On Oct. 15, 2014, 4:03 a.m., Adam B wrote:
> > src/slave/slave.cpp, lines 938-948
> > <https://reviews.apache.org/r/26699/diff/1/?file=720970#file720970line938>
> >
> >     Couldn't the Slave and the SUM get out of sync here? Right now, the SUM 
> > will flush its pending status updates as soon as a new master is detected.
> >     I'm imagining a scenario where the SUM is flushing status updates and 
> > the slave handles a status ACK interleaved with a slave re-registration 
> > delivering stale or out-of-sync task states.
> >     Wouldn't it just be better if the SUM didn't flush until after the 
> > slave has successfully re-registered?
> 
> Vinod Kone wrote:
>     Definitely thought about this race. 
>     
>     Yes, it would be better if SUM did the flush after re-registration but I 
> think it is still a race because re-registration could happen due to ZK blips 
> where updates and acks are in flight.
>     
>     I added a comment on why it is safe. Let me know if you still have 
> concerns.
> 
> Vinod Kone wrote:
>     Actually, after thinking more about this and discussing with BenM, there 
> is fundamentally still a race between slave sending re-registered message and 
> SUM sending its update. To fix this, I will make couple changes.
>     1) Instead of SUM directly sending updates to the master, it will send to 
> the slave which will forward it to the master.
>     2) Slave updates the latest state of the task when it gets an update from 
> the executor and unacknowledged state when it gets an update from the SUM.
>     3) The latest state in StatusUpdate will be set by the slave (instead of 
> SUM) before forwarding it to the master.
>     
>     These changes should guarantee that re-registration and update messages 
> are always in sync w.r.t to the latest and unacknowledged state of a task. As 
> a side benefit of 1), updates won't be sent to the master when slave is in 
> the middle of re-registration.
>     
>     Thoughts?
> 
> Adam B wrote:
>     1) I like the design idea: clear API boundaries. Funnel all communication 
> through the slave, and internal actor processes like SUM won't actually 
> communicate outside the slave node.
>     2,3) I wonder if maybe the slave should update the latest state of the 
> task when it acks the update from the executor, rather than on first receipt, 
> in case we need to ensure that state has been checkpointed? Really just a 
> matter of where in statusUpdate (or its continuations) you insert the line to 
> update the task's latest state.
>     I'll be able to reason more clearly about it after seeing it in code.

For 2,3) I kept the old semantics for now i.e., state changes happen in 
statusUpdate() instead of _statusUpdate(), since I didn't want to change the 
containerizer update semantics.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/#review56604
-----------------------------------------------------------


On Oct. 17, 2014, 12:26 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26699/
> -----------------------------------------------------------
> 
> (Updated Oct. 17, 2014, 12:26 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1799 and MESOS-1817
>     https://issues.apache.org/jira/browse/MESOS-1799
>     https://issues.apache.org/jira/browse/MESOS-1817
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Slave re-registration now sends both the latest state and unacknowledged 
> state to the master.
> 
> 
> Diffs
> -----
> 
>   src/messages/messages.proto 8de7f9699f43aa2780f4a39fed087abcf5e5af99 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/fault_tolerance_tests.cpp 
> a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/26699/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Ran new test 1000 times.
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>

Reply via email to