> On Feb. 18, 2015, 11:38 a.m., Niklas Nielsen wrote: > > Let's get tests wired up before committing this :) > > Adam B wrote: > Sure thing. Adding tests in my subsequent patch where we will pass the > master's timeout values on to the slave. Will post that very soon. > > Ben Mahler wrote: > Can you do it in one patch? This patch in isolation looks a bit dangerous > per our conversation above. > > Also, please carefully consider whether your approach will be safe to do > in a single version. i.e. What happens when there are old slaves running > against a new master? And vice versa.
Easy enough. Added the second patch to this one. Most of the new changes are in messages.proto, slave.hpp/cpp, and partition_tests.cpp, with a few lines in master.cpp to set the new message fields. Certainly safe to upgrade if the new flags stay at their default value. Also, with new slaves talking to an old master, the master will still use the defaults, hence so will the slaves. But old slaves running against a new master with a longer timeout will try to reregister unnecessarily early, so you may want to guarantee that all/most of the slaves have been upgraded before setting these flags, otherwise a large cluster suddenly waking up from a network split would see a lot of unnecessary reregistration attempts. The old behavior in this scenario was that the slaves would reregister after the same default timeout after the network split, and the master would consider them shutdown after the same timeout. If that last scenario is a problem, maybe the better approach is for each slave to specify its own ping timeout, and then the master can use these values with each slave's SlaveObserver. Then we can just require the master(s) to be upgraded first, as is typical. - Adam ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review72992 ----------------------------------------------------------- On Feb. 19, 2015, 12:10 a.m., Adam B wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/29507/ > ----------------------------------------------------------- > > (Updated Feb. 19, 2015, 12:10 a.m.) > > > Review request for mesos, Ben Mahler and Niklas Nielsen. > > > Bugs: MESOS-2110 > https://issues.apache.org/jira/browse/MESOS-2110 > > > Repository: mesos > > > Description > ------- > > Added new --slave_ping_timeout and --max_slave_ping_timeouts flags > to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) > and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). > > These can be extended if slaves are expected/allowed to be down for > longer than a minute or two. > > Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. > > Beware that this affects recovery from network timeouts as well as > actual slave node/process failover. > > > Diffs > ----- > > src/master/constants.hpp ad3fe81 > src/master/constants.cpp d3d0f71 > src/master/flags.hpp 51a6059 > src/master/master.cpp f10a3cf > src/messages/messages.proto 58484ae > src/slave/constants.hpp 12d6e92 > src/slave/constants.cpp 7868bef > src/slave/slave.hpp 91dae10 > src/slave/slave.cpp aec9525 > src/tests/fault_tolerance_tests.cpp efa5c57 > src/tests/partition_tests.cpp eb16a58 > src/tests/slave_recovery_tests.cpp 8210c52 > src/tests/slave_tests.cpp 153d9d6 > > Diff: https://reviews.apache.org/r/29507/diff/ > > > Testing > ------- > > Manually tested slave failover/shutdown with master using different > --slave_ping_timeout and --max_slave_ping_timeouts. > Ran unit tests with shorter non-default values for ping timeouts. > `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and > ShortPingTimeoutUnreachableSlave > > > Thanks, > > Adam B > >