Re: Review Request 64930: Unified the marking agent unreachable logic in the master.

Benjamin Mahler Tue, 09 Jan 2018 00:07:42 -0800


> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > The structure of the new `markUnreachable()` seems to be `[recover slave 
> > info depending on bool parameter] -> common code path to apply registry 
> > operation -> [notification/cleanup depending on bool parameter]`. 
> > Intuitively I'd try to get rid of the bool parameter and only put the 
> > shared part into `markUnreachable()`, but on the other hand the code might 
> > get even more complicated when you attempt to do this, so it should be fine 
> > to leave the general structure as is.

Thanks for the review!

Yes, it's a little unfortunate, I tried to unify as much as possible (you'll 
notice that it now consistently checks against agent states, for example). I 
initially considered getting rid of the boolean by inferring the state from 
whether the agent is in `recovered` or `registered`, however since it's 
asynchronously called it seems like there may be some risky cases (e.g. agent 
moved from `recovered` to `registered` in the interim, so it should be 
canceled, but we infer we're marking a registered agent as unreachable!).

My thinking was that these can potentially go away longer term if we treat 
recovered agents more consistently vs steady state agents by storing a `Slave*` 
for them (similar to how we now store a `Framework*` for frameworks recovered 
from agent information).

Let me know if you have any thoughts!

> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.hpp
> > Lines 512 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930013#file1930013line512>
> >
> >     Whats actually the reasoning behind crashing the master instead of 
> > returning a failed future?

Currently any registry failure is fatal: if we cannot persist any information, 
we can't make progress on agent lifecycle events, configuration updates, etc. 
Rather than trying to ensure that each caller handles the future failure in a 
consistent way, I opted to just consistently abort within this function.

Thoughts?

> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.cpp
> > Line 2024 (original), 2025 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line2026>
> >
> >     Double space after 'within'.

Thanks for catching these!

> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.cpp
> > Line 8277 (original), 8218 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line8307>
> >
> >     Do we need an `onDiscarded()` handler on an `undiscardable()` future?

The undiscardable wrapper just prevents the discard request from propagating 
inward / breaking any .then chains. It's still possible for there to be a 
discarded future returned, so I felt it was safer to handle it.

> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.cpp
> > Line 8304 (original), 8235 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line8336>
> >
> >     After a similar a change I got the following comment from @dzhuk, which 
> > I'll relay here because I think it's justified:
> >     
> >     > I think `CHECK` should be used to validate invariants that we have 
> > from some other context. but error here is something that can happen during 
> > normal operation. and the original version would produce a meaningful 
> > message in log

Hm.. I think the current invariant is actually that these should always succeed 
given the way we apply registry operations in the master, is that not the case? 
Do you have a concrete suggestion for a follow up? Are you suggesting logging 
and continuing? Probably warrants a more subtantial conversation outside of 
this review.

> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.cpp
> > Lines 8251 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line8352>
> >
> >     Not sure what our coding style says about this, but we should think 
> > about preferring `.find()` over `.contains()` to avoid the double lookup. 
> > (which gcc  cannot optimize away even at `-O3`, according to some quick 
> > local testing)

We tend to avoid `.find` since it results in less readable code (` == _.end()` 
and iterator indirection), but we have made the optimization in some 
performance critical code paths. This doesn't seem to be one so I tried to make 
it more readable and consistent with our usage of `contains` throughout the 
code base.

There's also `Option<T> hashmap::get`, which if we update to return a reference 
should be as efficient as `.find`? Anyway, we can discuss separately from this 
review.

- Benjamin

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64930/#review194946
-----------------------------------------------------------

On Jan. 3, 2018, 10:48 p.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64930/
> -----------------------------------------------------------
> 
> (Updated Jan. 3, 2018, 10:48 p.m.)
> 
> 
> Review request for mesos, Benno Evers and Vinod Kone.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The logic for marking an agent unreachable in the master had two
> very similar code paths that differed slightly across failover
> and steady state cases. This patch uses a single code path.
> 
> Unfortunately, some slight differences were necessary, and a
> failover boolean was introduced to accomodate them. Specifically,
> the failover and steady state cases expect the agent to reside
> in the recovered and registered lists, respectively.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 8fe9420dbe03ea2cefc6a40b0f64284aa9fe7915 
>   src/master/master.cpp 03eb178fa1af7d55ae387e6cb42cdc8d721a2196 
>   src/tests/slave_tests.cpp fb49077d7cb81db450d9fa24f75dbd9c79ef645c 
> 
> 
> Diff: https://reviews.apache.org/r/64930/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>

Re: Review Request 64930: Unified the marking agent unreachable logic in the master.

Reply via email to