> On Feb. 18, 2015, 7:38 p.m., Niklas Nielsen wrote:
> > Let's get tests wired up before committing this :)
> 
> Adam B wrote:
>     Sure thing. Adding tests in my subsequent patch where we will pass the 
> master's timeout values on to the slave. Will post that very soon.
> 
> Ben Mahler wrote:
>     Can you do it in one patch? This patch in isolation looks a bit dangerous 
> per our conversation above.
>     
>     Also, please carefully consider whether your approach will be safe to do 
> in a single version. i.e. What happens when there are old slaves running 
> against a new master? And vice versa.
> 
> Adam B wrote:
>     Easy enough. Added the second patch to this one. Most of the new changes 
> are in messages.proto, slave.hpp/cpp, and partition_tests.cpp, with a few 
> lines in master.cpp to set the new message fields.
>     
>     Certainly safe to upgrade if the new flags stay at their default value. 
> Also, with new slaves talking to an old master, the master will still use the 
> defaults, hence so will the slaves. 
>     
>     But old slaves running against a new master with a longer timeout will 
> try to reregister unnecessarily early, so you may want to guarantee that 
> all/most of the slaves have been upgraded before setting these flags, 
> otherwise a large cluster suddenly waking up from a network split would see a 
> lot of unnecessary reregistration attempts. The old behavior in this scenario 
> was that the slaves would reregister after the same default timeout after the 
> network split, and the master would consider them shutdown after the same 
> timeout.
>     If that last scenario is a problem, maybe the better approach is for each 
> slave to specify its own ping timeout, and then the master can use these 
> values with each slave's SlaveObserver. Then we can just require the 
> master(s) to be upgraded first, as is typical.
> 
> Adam B wrote:
>     I'm inclined to think that we shouldn't worry too much about the unlikely 
> scenario of a network split in the middle of a rolling upgrade to 0.23 where 
> longer ping timeouts are simultaneously being applied. Procedure should be: 
> upgrade (most of) the cluster, then restart the master(s) with the new longer 
> ping timeout value. I can add a Note to upgrades.md if you like. How does 
> that sound to you @bmahler ?
>     If there are no objections, I'll rebase the patch and update the 
> configuration and upgrades docs.

Yeah that sounds fine to me (might be prudent to call it out in the upgrade doc 
or the changelog).

Mind reaching out to me directly when this is ready for more reviewing? :)


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review72992
-----------------------------------------------------------


On Feb. 19, 2015, 8:10 a.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated Feb. 19, 2015, 8:10 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -----
> 
>   src/master/constants.hpp ad3fe81 
>   src/master/constants.cpp d3d0f71 
>   src/master/flags.hpp 51a6059 
>   src/master/master.cpp f10a3cf 
>   src/messages/messages.proto 58484ae 
>   src/slave/constants.hpp 12d6e92 
>   src/slave/constants.cpp 7868bef 
>   src/slave/slave.hpp 91dae10 
>   src/slave/slave.cpp aec9525 
>   src/tests/fault_tolerance_tests.cpp efa5c57 
>   src/tests/partition_tests.cpp eb16a58 
>   src/tests/slave_recovery_tests.cpp 8210c52 
>   src/tests/slave_tests.cpp 153d9d6 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>

Reply via email to