-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review86461
-----------------------------------------------------------


Overall it's looking pretty good! I'd suggest splitting out the tests given the 
comments on the source changes are fairly minor.

Echoing Nik, it would be nice to have some sanity bound checks for this flag, 
to prevent operators from burning themselves.


src/master/master.cpp
<https://reviews.apache.org/r/29507/#comment138489>

    I'm guessing the snake_case came from the flag names, but this should use 
camelCase.



src/master/master.cpp
<https://reviews.apache.org/r/29507/#comment138524>

    Duration exposes .secs() for this purpose:
    
    s/Seconds(pingTimeout).value()/pingTimeout.secs()/
    
    Can you update it here and in the other cases below?



src/messages/messages.proto
<https://reviews.apache.org/r/29507/#comment138491>

    The defaults here seem to suggest that we will rely on them (i.e. we don't 
check if the field is set, rather we always grab the value). However, 0 is 
going to be a problematic value! I'd suggest removing the defaults, and rather, 
adding a comment here for posterity that indicates that these were added in 
0.23.0.



src/slave/slave.hpp
<https://reviews.apache.org/r/29507/#comment138493>

    Ditto Nik's comment for camelCase, but also, could you update these to use 
`Duration` rather than directly using a `double`?



src/slave/slave.cpp
<https://reviews.apache.org/r/29507/#comment138520>

    Could you move this up and do it once before the switch?
    
    I see now why you probably wanted the explicit default, it's too bad we 
don't have support to bind in `Option`s for optional fields in install 
handlers. How about we rely on the implicit default: 
https://developers.google.com/protocol-buffers/docs/proto#optional
    
    Alternatively, you could have a nested message that contains the fields 
(e.g. 'Connection') which allows you to explicitly check for 
'`has_ping_timeout_seconds`').



src/slave/slave.cpp
<https://reviews.apache.org/r/29507/#comment138522>

    "once re-registered"?



src/tests/partition_tests.cpp
<https://reviews.apache.org/r/29507/#comment138527>

    I'm a bit confused by this test, why are you expecting the first ping? Why 
are you dropping the pongs?
    
    I would expect the pings to be dropped entirely here so that advancing the 
clock after a slave is registered leads to a re-registration.
    
    Also, did you know about SlaveTest.PingTimeoutNoPings and 
SlaveTest.PingTimeoutSomePings tests? Is there something being tested here that 
is not captured in these?



src/tests/partition_tests.cpp
<https://reviews.apache.org/r/29507/#comment138525>

    Can you do a sweep and replace all of these Seconds(duration).value() cases 
with duration.secs()?



src/tests/partition_tests.cpp
<https://reviews.apache.org/r/29507/#comment138529>

    This sounds like testing two particular behaviors in a single test:
    
    (1) A registered slave that never receives any Ping messages will attempt 
to reregister after the master's ping timeout.
    
    (2) Master will consider it shutdown after the same timeout.
    
    I believe we already have tests for both of these. For (1) we could just 
add the shutdown message expectation to `PartitionTest.PartitionedSlave`. For 
(2), we have SlaveTest.PingTimeoutNoPings, SlaveTest.PingTimeoutNoPings.


- Ben Mahler


On May 28, 2015, 11:13 p.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated May 28, 2015, 11:13 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> Also fixed the log message in recoveredSlavesTimeout() to correctly
> reference flags.slave_reregister_timeout instead of the unrelated
> ping timeouts.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md 5a41477 
>   docs/upgrades.md 355307a 
>   src/master/constants.hpp 57cf8fb 
>   src/master/constants.cpp 8c7174a 
>   src/master/flags.hpp 84fa238 
>   src/master/flags.cpp 49d953a 
>   src/master/master.cpp 1526f59 
>   src/messages/messages.proto 39dac72 
>   src/slave/constants.hpp 206d439 
>   src/slave/constants.cpp d8d2f98 
>   src/slave/slave.hpp 0207eaf 
>   src/slave/slave.cpp b4d2029 
>   src/tests/partition_tests.cpp f7ee3ab 
>   src/tests/slave_recovery_tests.cpp c036e9c 
>   src/tests/slave_tests.cpp acae497 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>

Reply via email to