> On Feb. 18, 2015, 11:38 a.m., Niklas Nielsen wrote: > > Let's get tests wired up before committing this :) > > Adam B wrote: > Sure thing. Adding tests in my subsequent patch where we will pass the > master's timeout values on to the slave. Will post that very soon. > > Ben Mahler wrote: > Can you do it in one patch? This patch in isolation looks a bit dangerous > per our conversation above. > > Also, please carefully consider whether your approach will be safe to do > in a single version. i.e. What happens when there are old slaves running > against a new master? And vice versa. > > Adam B wrote: > Easy enough. Added the second patch to this one. Most of the new changes > are in messages.proto, slave.hpp/cpp, and partition_tests.cpp, with a few > lines in master.cpp to set the new message fields. > > Certainly safe to upgrade if the new flags stay at their default value. > Also, with new slaves talking to an old master, the master will still use the > defaults, hence so will the slaves. > > But old slaves running against a new master with a longer timeout will > try to reregister unnecessarily early, so you may want to guarantee that > all/most of the slaves have been upgraded before setting these flags, > otherwise a large cluster suddenly waking up from a network split would see a > lot of unnecessary reregistration attempts. The old behavior in this scenario > was that the slaves would reregister after the same default timeout after the > network split, and the master would consider them shutdown after the same > timeout. > If that last scenario is a problem, maybe the better approach is for each > slave to specify its own ping timeout, and then the master can use these > values with each slave's SlaveObserver. Then we can just require the > master(s) to be upgraded first, as is typical. > > Adam B wrote: > I'm inclined to think that we shouldn't worry too much about the unlikely > scenario of a network split in the middle of a rolling upgrade to 0.23 where > longer ping timeouts are simultaneously being applied. Procedure should be: > upgrade (most of) the cluster, then restart the master(s) with the new longer > ping timeout value. I can add a Note to upgrades.md if you like. How does > that sound to you @bmahler ? > If there are no objections, I'll rebase the patch and update the > configuration and upgrades docs. > > Ben Mahler wrote: > Yeah that sounds fine to me (might be prudent to call it out in the > upgrade doc or the changelog). > > Mind reaching out to me directly when this is ready for more reviewing? :)
Absolutely. Thanks. :) - Adam ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review72992 ----------------------------------------------------------- On Feb. 19, 2015, 12:10 a.m., Adam B wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/29507/ > ----------------------------------------------------------- > > (Updated Feb. 19, 2015, 12:10 a.m.) > > > Review request for mesos, Ben Mahler and Niklas Nielsen. > > > Bugs: MESOS-2110 > https://issues.apache.org/jira/browse/MESOS-2110 > > > Repository: mesos > > > Description > ------- > > Added new --slave_ping_timeout and --max_slave_ping_timeouts flags > to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) > and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). > > These can be extended if slaves are expected/allowed to be down for > longer than a minute or two. > > Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. > > Beware that this affects recovery from network timeouts as well as > actual slave node/process failover. > > > Diffs > ----- > > src/master/constants.hpp ad3fe81 > src/master/constants.cpp d3d0f71 > src/master/flags.hpp 51a6059 > src/master/master.cpp f10a3cf > src/messages/messages.proto 58484ae > src/slave/constants.hpp 12d6e92 > src/slave/constants.cpp 7868bef > src/slave/slave.hpp 91dae10 > src/slave/slave.cpp aec9525 > src/tests/fault_tolerance_tests.cpp efa5c57 > src/tests/partition_tests.cpp eb16a58 > src/tests/slave_recovery_tests.cpp 8210c52 > src/tests/slave_tests.cpp 153d9d6 > > Diff: https://reviews.apache.org/r/29507/diff/ > > > Testing > ------- > > Manually tested slave failover/shutdown with master using different > --slave_ping_timeout and --max_slave_ping_timeouts. > Ran unit tests with shorter non-default values for ping timeouts. > `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and > ShortPingTimeoutUnreachableSlave > > > Thanks, > > Adam B > >