> On Feb. 14, 2019, 12:31 p.m., Greg Mann wrote: > > src/master/master.cpp > > Lines 8686-8708 (original), 8686-8712 (patched) > > <https://reviews.apache.org/r/69980/diff/1/?file=2125184#file2125184line8686> > > > > Consider the case of a terminal-but-unacknowledged operation which has > > been sent to the master by a reregistered agent and which has its ID set. > > Since we only place non-terminal operations in `orphanedOperations`, we > > will get `frameworkWillAcknowledge == true` here. If this framework never > > reregisters, the I think we could end up in a state where the agent retries > > terminal updates for that operation forever. > > > > For such updates, I think the master needs to either: > > 1) have a way to determine that this is a terminal-but-unacknowledged > > orphaned operation (i.e. place it in `orphanedOperations`), or > > 2) fall back to default behavior of acknowledging updates for > > operations that it doesn't recognize. > > > > WDYT?
This is a good point. Orphan operations must be include terminal and non-terminal operations. With the chain as it is right now, it is only possible to produce a terminal orphan operation by adding a non-terminal orphan, and then receiving a terminal update. This transition is a bit weird, since we have a case where an UpdateSlaveMessage can contain a terminal operation, belonging to an unknown framework. As long as the `Master::updateSlave()` method marks this as an orphan, the master will be able to adopt the orphan. (This is choice 1). - Joseph ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69980/#review212838 ----------------------------------------------------------- On Feb. 13, 2019, 3:23 p.m., Joseph Wu wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69980/ > ----------------------------------------------------------- > > (Updated Feb. 13, 2019, 3:23 p.m.) > > > Review request for mesos, Benno Evers, Gastón Kleiman, and Greg Mann. > > > Bugs: MESOS-9542 > https://issues.apache.org/jira/browse/MESOS-9542 > > > Repository: mesos > > > Description > ------- > > When dealing with orphaned operation status updates, there are two > cases the master must deal with: > - The simple case is when the master knows the framework is completed. > These status updates can be acknowledged by the master. > - However, a completed framework can be rotated out of the master's > memory. In addition, after master failover, if an agent reregisters > before the framework, an operation can appear to be orphaned until > the framework reregisters. > > This adds a fixed delay between agent reregistration and when the > master acknowledges operation status updates from unknown frameworks. > The delay should give frameworks ample time to reregister. > > The delay is based on agent reregistration in order to mitigate the > delay of acknowledging status updates of frameworks rotated out of > the completed frameworks buffer. > > > Diffs > ----- > > src/master/constants.hpp b0ab9187b8c672180e2ffb8b63cb7349dbe43ac4 > src/master/master.cpp 014e0e053cdf5c53a5ef8d63300205a121bed319 > > > Diff: https://reviews.apache.org/r/69980/diff/1/ > > > Testing > ------- > > TODO: This case needs unit tests. > > > Thanks, > > Joseph Wu > >
