> On Oct. 15, 2014, 4:03 a.m., Adam B wrote: > > src/slave/slave.cpp, lines 938-948 > > <https://reviews.apache.org/r/26699/diff/1/?file=720970#file720970line938> > > > > Couldn't the Slave and the SUM get out of sync here? Right now, the SUM > > will flush its pending status updates as soon as a new master is detected. > > I'm imagining a scenario where the SUM is flushing status updates and > > the slave handles a status ACK interleaved with a slave re-registration > > delivering stale or out-of-sync task states. > > Wouldn't it just be better if the SUM didn't flush until after the > > slave has successfully re-registered? > > Vinod Kone wrote: > Definitely thought about this race. > > Yes, it would be better if SUM did the flush after re-registration but I > think it is still a race because re-registration could happen due to ZK blips > where updates and acks are in flight. > > I added a comment on why it is safe. Let me know if you still have > concerns. > > Vinod Kone wrote: > Actually, after thinking more about this and discussing with BenM, there > is fundamentally still a race between slave sending re-registered message and > SUM sending its update. To fix this, I will make couple changes. > 1) Instead of SUM directly sending updates to the master, it will send to > the slave which will forward it to the master. > 2) Slave updates the latest state of the task when it gets an update from > the executor and unacknowledged state when it gets an update from the SUM. > 3) The latest state in StatusUpdate will be set by the slave (instead of > SUM) before forwarding it to the master. > > These changes should guarantee that re-registration and update messages > are always in sync w.r.t to the latest and unacknowledged state of a task. As > a side benefit of 1), updates won't be sent to the master when slave is in > the middle of re-registration. > > Thoughts? > > Adam B wrote: > 1) I like the design idea: clear API boundaries. Funnel all communication > through the slave, and internal actor processes like SUM won't actually > communicate outside the slave node. > 2,3) I wonder if maybe the slave should update the latest state of the > task when it acks the update from the executor, rather than on first receipt, > in case we need to ensure that state has been checkpointed? Really just a > matter of where in statusUpdate (or its continuations) you insert the line to > update the task's latest state. > I'll be able to reason more clearly about it after seeing it in code.
For 2,3) I kept the old semantics for now i.e., state changes happen in statusUpdate() instead of _statusUpdate(), since I didn't want to change the containerizer update semantics. - Vinod ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26699/#review56604 ----------------------------------------------------------- On Oct. 17, 2014, 12:26 a.m., Vinod Kone wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/26699/ > ----------------------------------------------------------- > > (Updated Oct. 17, 2014, 12:26 a.m.) > > > Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen. > > > Bugs: MESOS-1799 and MESOS-1817 > https://issues.apache.org/jira/browse/MESOS-1799 > https://issues.apache.org/jira/browse/MESOS-1817 > > > Repository: mesos-git > > > Description > ------- > > Slave re-registration now sends both the latest state and unacknowledged > state to the master. > > > Diffs > ----- > > src/messages/messages.proto 8de7f9699f43aa2780f4a39fed087abcf5e5af99 > src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 > src/tests/fault_tolerance_tests.cpp > a75910d4f486230ba3f1d8927e5f1e5fda6e287b > src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f > > Diff: https://reviews.apache.org/r/26699/diff/ > > > Testing > ------- > > make check > > Ran new test 1000 times. > > > Thanks, > > Vinod Kone > >
