> On Feb. 18, 2015, 11:38 a.m., Niklas Nielsen wrote:
> > Let's get tests wired up before committing this :)
> 
> Adam B wrote:
>     Sure thing. Adding tests in my subsequent patch where we will pass the 
> master's timeout values on to the slave. Will post that very soon.
> 
> Ben Mahler wrote:
>     Can you do it in one patch? This patch in isolation looks a bit dangerous 
> per our conversation above.
>     
>     Also, please carefully consider whether your approach will be safe to do 
> in a single version. i.e. What happens when there are old slaves running 
> against a new master? And vice versa.

Easy enough. Added the second patch to this one. Most of the new changes are in 
messages.proto, slave.hpp/cpp, and partition_tests.cpp, with a few lines in 
master.cpp to set the new message fields.

Certainly safe to upgrade if the new flags stay at their default value. Also, 
with new slaves talking to an old master, the master will still use the 
defaults, hence so will the slaves. 

But old slaves running against a new master with a longer timeout will try to 
reregister unnecessarily early, so you may want to guarantee that all/most of 
the slaves have been upgraded before setting these flags, otherwise a large 
cluster suddenly waking up from a network split would see a lot of unnecessary 
reregistration attempts. The old behavior in this scenario was that the slaves 
would reregister after the same default timeout after the network split, and 
the master would consider them shutdown after the same timeout.
If that last scenario is a problem, maybe the better approach is for each slave 
to specify its own ping timeout, and then the master can use these values with 
each slave's SlaveObserver. Then we can just require the master(s) to be 
upgraded first, as is typical.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review72992
-----------------------------------------------------------


On Feb. 19, 2015, 12:10 a.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated Feb. 19, 2015, 12:10 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -----
> 
>   src/master/constants.hpp ad3fe81 
>   src/master/constants.cpp d3d0f71 
>   src/master/flags.hpp 51a6059 
>   src/master/master.cpp f10a3cf 
>   src/messages/messages.proto 58484ae 
>   src/slave/constants.hpp 12d6e92 
>   src/slave/constants.cpp 7868bef 
>   src/slave/slave.hpp 91dae10 
>   src/slave/slave.cpp aec9525 
>   src/tests/fault_tolerance_tests.cpp efa5c57 
>   src/tests/partition_tests.cpp eb16a58 
>   src/tests/slave_recovery_tests.cpp 8210c52 
>   src/tests/slave_tests.cpp 153d9d6 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>

Reply via email to