Re: Review Request 55307: Improved handling of agents that restart but never re-register.

Vinod Kone Sun, 08 Jan 2017 19:18:30 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55307/#review160859
-----------------------------------------------------------





src/master/flags.cpp (line 127)
<https://reviews.apache.org/r/55307/#comment232086>

    s/each/an/ ?



src/master/master.cpp (line 1333)
<https://reviews.apache.org/r/55307/#comment232088>

    s/limiter/limited/



src/master/master.cpp (line 1339)
<https://reviews.apache.org/r/55307/#comment232089>

    can you do "<< *slave" here?



src/master/master.cpp (lines 1368 - 1370)
<https://reviews.apache.org/r/55307/#comment232090>

    just like "health check time out", can this be succinct? maybe 
"re-registration time out"?



src/slave/slave.cpp (line 4282)
<https://reviews.apache.org/r/55307/#comment232091>

    I think this is worth calling out in the CHANGELOG because it is a 
significant change in the behavior.



src/tests/master_tests.cpp (lines 6105 - 6117)
<https://reviews.apache.org/r/55307/#comment232092>

    Instead of advancing the clock for the recovery to finish, why not just not 
advance it. That way you don't have to do the registration drops either?
    
    also, you only do `detector.appoint` way below; does agent even send 
re-registration messages without that here? i guess it does if the master pid 
didn't change?



src/tests/master_tests.cpp (lines 6270 - 6276)
<https://reviews.apache.org/r/55307/#comment232095>

    see above.


- Vinod Kone


On Jan. 7, 2017, 9:13 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55307/
> -----------------------------------------------------------
> 
> (Updated Jan. 7, 2017, 9:13 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6286
>     https://issues.apache.org/jira/browse/MESOS-6286
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The master expected that if an agent responds to pings, it will
> (eventually) register or re-register. However, if the agent hangs during
> recovery, that assumption does not hold: the agent will continue to
> respond to pings but won't attempt to re-register until recovery
> finishes.
> 
> To handle this case, the master now expects an agent to re-register
> within `agent_reregister_timeout` if the master -> agent socket breaks;
> if no re-registration is seen, the master will mark the agent
> unreachable. This is a "backup" to handle the case where recovery hangs,
> as explained above; more commonly, the agent will re-register (when it
> receives a ping and notices the master believes it is disconnected) or
> be marked unreachable because it fails to respond to pings.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md e4beb2d5a72f1c5f59b2e40f4984cc60b8437d9d 
>   src/master/flags.cpp 737290a42c532f2349009d0a451ce271d6f107b9 
>   src/master/master.hpp 57fc6e6f2995078df80f0aa52707727db802ede0 
>   src/master/master.cpp 11c34a048586d30c6ac67be8638ed8fa81cc3f1f 
>   src/slave/slave.cpp f8f2ccfadb9a00be17c0b552586aa5875b7cbb19 
>   src/tests/master_tests.cpp 1cf4c92b2474e18771459f877b2f3c49077e8a01 
>   src/tests/slave_tests.cpp d633a74d6b342239fbca0b44eec281eb315df5ff 
> 
> Diff: https://reviews.apache.org/r/55307/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Ran new tests a few thousand times on OSX and Linux VM to check for flakiness.
> 
> 
> Thanks,
> 
> Neil Conway
> 
>

Re: Review Request 55307: Improved handling of agents that restart but never re-register.

Reply via email to