> On May 31, 2017, 12:09 a.m., Benjamin Mahler wrote: > > src/master/master.cpp > > Lines 6018-6025 (original), 6018-6025 (patched) > > <https://reviews.apache.org/r/59320/diff/2/?file=1721802#file1721802line6018> > > > > Hm.. what does slaveWasRemoved mean? Is it equivalen to the agent being > > marked as unreachable? Or is being removed something beyond being marked as > > unreachable (what it sounds like)? I had a hard time grasping the code > > below when it was referring to "unreachable" but the variable is called > > "removed".
`slaveWasRemoved` means that _this_ instance of the master was the one that removed the agent (because `removed` is only a cache, there might be false negatives). Similarly, `removed` contains agents that _this_ instance of the master has marked unreachable. No objection to renaming both variables, as long as we keep them in sync, although I can't think of a ton of better names. > On May 31, 2017, 12:09 a.m., Benjamin Mahler wrote: > > src/master/master.cpp > > Lines 6027-6043 (original), 6027-6043 (patched) > > <https://reviews.apache.org/r/59320/diff/2/?file=1721802#file1721802line6027> > > > > Hm.. it seems like we could maybe make this code easier to read. > > Currently, the loop for dealing with frameworks is down below and the loop > > for dealing with tasks is up here. However, it seems to me that these are > > highly related, and it would be easier to reason about if it were > > consolidated into a loop over the framework and then a loop over the tasks > > belonging to that framework. That would also eliminate the need for > > `partitionAwareFrameworks`. > > > > E.g. > > > > ``` > > foreach (const FrameworkInfo& framework, frameworks): > > if completed: > > shut down the framework, drop the tasks > > continue > > > > if agent was unreachable/removed and framework is not partition > > aware: > > shut down the framework, drop the tasks > > continue > > > > for each task in the framework: > > recoveredTasks.push_back(task) > > remove from unreachable tasks > > ``` Hmmm. I can see how this would makes things clearer, but it might result in some subtle behavioral differences. Right now, the master notifies the agent that it has re-registered, then sends it `ShutdownFrameworkMessage`s in some cases. Adopting the rewrite you suggested would mean shutting down frameworks while the agent thinks it hasn't re-registered yet. Not wrong but confusing. Since this code is likely to change (see the "RFC: Partition Awareness" thread on `dev`), I'm inclined to leave this as-is in the short term. - Neil ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/59320/#review176399 ----------------------------------------------------------- On May 31, 2017, 8:35 p.m., Neil Conway wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/59320/ > ----------------------------------------------------------- > > (Updated May 31, 2017, 8:35 p.m.) > > > Review request for mesos and Benjamin Mahler. > > > Repository: mesos > > > Description > ------- > > Add an explanatory comment and a test case for a particular case in the > master's logic for handling agent re-registration. > > > Diffs > ----- > > src/master/master.cpp a5376f9fc3c3c6c72b22cfd36ec56a1de0d03a3a > src/tests/partition_tests.cpp 62a84f797201ccd18b71490949e3130d2b9c3668 > > > Diff: https://reviews.apache.org/r/59320/diff/3/ > > > Testing > ------- > > `make check` > > Checked that if you add `CHECK_NOTNULL(framework)` to the master code (circa > the newly added comment), the incorrect `CHECK` is triggered by this test > case (but not by any existing test cases). > > > Thanks, > > Neil Conway > >
