> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote: > > The structure of the new `markUnreachable()` seems to be `[recover slave > > info depending on bool parameter] -> common code path to apply registry > > operation -> [notification/cleanup depending on bool parameter]`. > > Intuitively I'd try to get rid of the bool parameter and only put the > > shared part into `markUnreachable()`, but on the other hand the code might > > get even more complicated when you attempt to do this, so it should be fine > > to leave the general structure as is. > > Benjamin Mahler wrote: > Thanks for the review! > > Yes, it's a little unfortunate, I tried to unify as much as possible > (you'll notice that it now consistently checks against agent states, for > example). I initially considered getting rid of the boolean by inferring the > state from whether the agent is in `recovered` or `registered`, however since > it's asynchronously called it seems like there may be some risky cases (e.g. > agent moved from `recovered` to `registered` in the interim, so it should be > canceled, but we infer we're marking a registered agent as unreachable!). > > My thinking was that these can potentially go away longer term if we > treat recovered agents more consistently vs steady state agents by storing a > `Slave*` for them (similar to how we now store a `Framework*` for frameworks > recovered from agent information). > > Let me know if you have any thoughts!
Sounds good to me. > On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote: > > src/master/master.hpp > > Lines 512 (patched) > > <https://reviews.apache.org/r/64930/diff/1/?file=1930013#file1930013line512> > > > > Whats actually the reasoning behind crashing the master instead of > > returning a failed future? > > Benjamin Mahler wrote: > Currently any registry failure is fatal: if we cannot persist any > information, we can't make progress on agent lifecycle events, configuration > updates, etc. Rather than trying to ensure that each caller handles the > future failure in a consistent way, I opted to just consistently abort within > this function. > > Thoughts? It seems to me like now you have the opposite problem, i.e. ensuring that each caller actually wants to consistently abort. But I was mostly just curious. > On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote: > > src/master/master.cpp > > Line 8277 (original), 8218 (patched) > > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line8307> > > > > Do we need an `onDiscarded()` handler on an `undiscardable()` future? > > Benjamin Mahler wrote: > The undiscardable wrapper just prevents the discard request from > propagating inward / breaking any .then chains. It's still possible for there > to be a discarded future returned, so I felt it was safer to handle it. Ok, then. The naming of `undiscardable()` seems a bit unfortunate in this case :D > On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote: > > src/master/master.cpp > > Line 8304 (original), 8235 (patched) > > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line8336> > > > > After a similar a change I got the following comment from @dzhuk, which > > I'll relay here because I think it's justified: > > > > > I think `CHECK` should be used to validate invariants that we have > > from some other context. but error here is something that can happen during > > normal operation. and the original version would produce a meaningful > > message in log > > Benjamin Mahler wrote: > Hm.. I think the current invariant is actually that these should always > succeed given the way we apply registry operations in the master, is that not > the case? Do you have a concrete suggestion for a follow up? Are you > suggesting logging and continuing? Probably warrants a more subtantial > conversation outside of this review. I agree that we should crash here, I was just referring to reporting the error via `LOG(FATAL)` with a human-readable error message vs. `CHECK()`-ing here. If I understand the registrar code correctly, the future could be failed for external reasons (e.g. disk full), which probably shouldn't translate to assertion failures in the logs. > On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote: > > src/master/master.cpp > > Lines 8251 (patched) > > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line8352> > > > > Not sure what our coding style says about this, but we should think > > about preferring `.find()` over `.contains()` to avoid the double lookup. > > (which gcc cannot optimize away even at `-O3`, according to some quick > > local testing) > > Benjamin Mahler wrote: > We tend to avoid `.find` since it results in less readable code (` == > _.end()` and iterator indirection), but we have made the optimization in some > performance critical code paths. This doesn't seem to be one so I tried to > make it more readable and consistent with our usage of `contains` throughout > the code base. > > There's also `Option<T> hashmap::get`, which if we update to return a > reference should be as efficient as `.find`? Anyway, we can discuss > separately from this review. >From a consistency point of view, I would actually argue that since we use >`find()` in some places anyways, and will not remove from these places for >performance reasons, we might as well use it everywhere. But I agree that this >isn't in the scope of this review. - Benno ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64930/#review194946 ----------------------------------------------------------- On Jan. 9, 2018, 8:06 a.m., Benjamin Mahler wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64930/ > ----------------------------------------------------------- > > (Updated Jan. 9, 2018, 8:06 a.m.) > > > Review request for mesos, Benno Evers and Vinod Kone. > > > Repository: mesos > > > Description > ------- > > The logic for marking an agent unreachable in the master had two > very similar code paths that differed slightly across failover > and steady state cases. This patch uses a single code path. > > Unfortunately, some slight differences were necessary, and a > failover boolean was introduced to accomodate them. Specifically, > the failover and steady state cases expect the agent to reside > in the recovered and registered lists, respectively. > > > Diffs > ----- > > src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 > src/master/master.cpp 1610f4331225235e85f5a83e5338870cef99a5c8 > src/tests/slave_tests.cpp 159192053087e971746943a1bc17a76143a2d9af > > > Diff: https://reviews.apache.org/r/64930/diff/2/ > > > Testing > ------- > > make check > > > Thanks, > > Benjamin Mahler > >
