> On Feb. 18, 2015, 11:38 a.m., Niklas Nielsen wrote:
> > Let's get tests wired up before committing this :)
> 
> Adam B wrote:
>     Sure thing. Adding tests in my subsequent patch where we will pass the 
> master's timeout values on to the slave. Will post that very soon.
> 
> Ben Mahler wrote:
>     Can you do it in one patch? This patch in isolation looks a bit dangerous 
> per our conversation above.
>     
>     Also, please carefully consider whether your approach will be safe to do 
> in a single version. i.e. What happens when there are old slaves running 
> against a new master? And vice versa.
> 
> Adam B wrote:
>     Easy enough. Added the second patch to this one. Most of the new changes 
> are in messages.proto, slave.hpp/cpp, and partition_tests.cpp, with a few 
> lines in master.cpp to set the new message fields.
>     
>     Certainly safe to upgrade if the new flags stay at their default value. 
> Also, with new slaves talking to an old master, the master will still use the 
> defaults, hence so will the slaves. 
>     
>     But old slaves running against a new master with a longer timeout will 
> try to reregister unnecessarily early, so you may want to guarantee that 
> all/most of the slaves have been upgraded before setting these flags, 
> otherwise a large cluster suddenly waking up from a network split would see a 
> lot of unnecessary reregistration attempts. The old behavior in this scenario 
> was that the slaves would reregister after the same default timeout after the 
> network split, and the master would consider them shutdown after the same 
> timeout.
>     If that last scenario is a problem, maybe the better approach is for each 
> slave to specify its own ping timeout, and then the master can use these 
> values with each slave's SlaveObserver. Then we can just require the 
> master(s) to be upgraded first, as is typical.
> 
> Adam B wrote:
>     I'm inclined to think that we shouldn't worry too much about the unlikely 
> scenario of a network split in the middle of a rolling upgrade to 0.23 where 
> longer ping timeouts are simultaneously being applied. Procedure should be: 
> upgrade (most of) the cluster, then restart the master(s) with the new longer 
> ping timeout value. I can add a Note to upgrades.md if you like. How does 
> that sound to you @bmahler ?
>     If there are no objections, I'll rebase the patch and update the 
> configuration and upgrades docs.
> 
> Ben Mahler wrote:
>     Yeah that sounds fine to me (might be prudent to call it out in the 
> upgrade doc or the changelog).
>     
>     Mind reaching out to me directly when this is ready for more reviewing? :)

Absolutely. Thanks. :)


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review72992
-----------------------------------------------------------


On Feb. 19, 2015, 12:10 a.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated Feb. 19, 2015, 12:10 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -----
> 
>   src/master/constants.hpp ad3fe81 
>   src/master/constants.cpp d3d0f71 
>   src/master/flags.hpp 51a6059 
>   src/master/master.cpp f10a3cf 
>   src/messages/messages.proto 58484ae 
>   src/slave/constants.hpp 12d6e92 
>   src/slave/constants.cpp 7868bef 
>   src/slave/slave.hpp 91dae10 
>   src/slave/slave.cpp aec9525 
>   src/tests/fault_tolerance_tests.cpp efa5c57 
>   src/tests/partition_tests.cpp eb16a58 
>   src/tests/slave_recovery_tests.cpp 8210c52 
>   src/tests/slave_tests.cpp 153d9d6 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>

Reply via email to