----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review86461 -----------------------------------------------------------
Overall it's looking pretty good! I'd suggest splitting out the tests given the comments on the source changes are fairly minor. Echoing Nik, it would be nice to have some sanity bound checks for this flag, to prevent operators from burning themselves. src/master/master.cpp <https://reviews.apache.org/r/29507/#comment138489> I'm guessing the snake_case came from the flag names, but this should use camelCase. src/master/master.cpp <https://reviews.apache.org/r/29507/#comment138524> Duration exposes .secs() for this purpose: s/Seconds(pingTimeout).value()/pingTimeout.secs()/ Can you update it here and in the other cases below? src/messages/messages.proto <https://reviews.apache.org/r/29507/#comment138491> The defaults here seem to suggest that we will rely on them (i.e. we don't check if the field is set, rather we always grab the value). However, 0 is going to be a problematic value! I'd suggest removing the defaults, and rather, adding a comment here for posterity that indicates that these were added in 0.23.0. src/slave/slave.hpp <https://reviews.apache.org/r/29507/#comment138493> Ditto Nik's comment for camelCase, but also, could you update these to use `Duration` rather than directly using a `double`? src/slave/slave.cpp <https://reviews.apache.org/r/29507/#comment138520> Could you move this up and do it once before the switch? I see now why you probably wanted the explicit default, it's too bad we don't have support to bind in `Option`s for optional fields in install handlers. How about we rely on the implicit default: https://developers.google.com/protocol-buffers/docs/proto#optional Alternatively, you could have a nested message that contains the fields (e.g. 'Connection') which allows you to explicitly check for '`has_ping_timeout_seconds`'). src/slave/slave.cpp <https://reviews.apache.org/r/29507/#comment138522> "once re-registered"? src/tests/partition_tests.cpp <https://reviews.apache.org/r/29507/#comment138527> I'm a bit confused by this test, why are you expecting the first ping? Why are you dropping the pongs? I would expect the pings to be dropped entirely here so that advancing the clock after a slave is registered leads to a re-registration. Also, did you know about SlaveTest.PingTimeoutNoPings and SlaveTest.PingTimeoutSomePings tests? Is there something being tested here that is not captured in these? src/tests/partition_tests.cpp <https://reviews.apache.org/r/29507/#comment138525> Can you do a sweep and replace all of these Seconds(duration).value() cases with duration.secs()? src/tests/partition_tests.cpp <https://reviews.apache.org/r/29507/#comment138529> This sounds like testing two particular behaviors in a single test: (1) A registered slave that never receives any Ping messages will attempt to reregister after the master's ping timeout. (2) Master will consider it shutdown after the same timeout. I believe we already have tests for both of these. For (1) we could just add the shutdown message expectation to `PartitionTest.PartitionedSlave`. For (2), we have SlaveTest.PingTimeoutNoPings, SlaveTest.PingTimeoutNoPings. - Ben Mahler On May 28, 2015, 11:13 p.m., Adam B wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/29507/ > ----------------------------------------------------------- > > (Updated May 28, 2015, 11:13 p.m.) > > > Review request for mesos, Ben Mahler and Niklas Nielsen. > > > Bugs: MESOS-2110 > https://issues.apache.org/jira/browse/MESOS-2110 > > > Repository: mesos > > > Description > ------- > > Added new --slave_ping_timeout and --max_slave_ping_timeouts flags > to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) > and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). > > These can be extended if slaves are expected/allowed to be down for > longer than a minute or two. > > Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. > > Beware that this affects recovery from network timeouts as well as > actual slave node/process failover. > > Also fixed the log message in recoveredSlavesTimeout() to correctly > reference flags.slave_reregister_timeout instead of the unrelated > ping timeouts. > > > Diffs > ----- > > docs/configuration.md 5a41477 > docs/upgrades.md 355307a > src/master/constants.hpp 57cf8fb > src/master/constants.cpp 8c7174a > src/master/flags.hpp 84fa238 > src/master/flags.cpp 49d953a > src/master/master.cpp 1526f59 > src/messages/messages.proto 39dac72 > src/slave/constants.hpp 206d439 > src/slave/constants.cpp d8d2f98 > src/slave/slave.hpp 0207eaf > src/slave/slave.cpp b4d2029 > src/tests/partition_tests.cpp f7ee3ab > src/tests/slave_recovery_tests.cpp c036e9c > src/tests/slave_tests.cpp acae497 > > Diff: https://reviews.apache.org/r/29507/diff/ > > > Testing > ------- > > Manually tested slave failover/shutdown with master using different > --slave_ping_timeout and --max_slave_ping_timeouts. > Ran unit tests with shorter non-default values for ping timeouts. > `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and > ShortPingTimeoutUnreachableSlave > > > Thanks, > > Adam B > >