> On June 24, 2015, 6:41 p.m., Ben Mahler wrote:
> > Actually, we should think about one more thing, how does this interact with 
> > the zookeeper session timeout?

The hardcoded individual ping timeout (15secs) was previously longer than the 
default zk session timeout (10secs), but the zk session timeout is already 
configurable with no bounds/validation, so users could already change it to be 
longer than an individual ping timeout, or even the total ping timeout. Is this 
a bad idea? What would this mean?
This is mostly relevant to the slave-side (as a leader detector), in that a 
slave may timeout and decide to check for a new leading master to reregister 
with, but the zk session may have recently gone bad but not yet timed out. 
Since a zk session could go bad/timeout at any point along the way, I don't 
think making the ping timeout configurable will introduce any new potential 
errors. Rather, it would just be advisable to keep the zk session timeouts 
reasonably small so that the zk session is likely to be healthy whenever a 
slave needs to detect a new leader. We could introduce validation that the 
zk_session_timeout is shorter than the (total) ping timeout, but I'm not sure 
that's even necessary.
Does this make sense? I'm no ZK expert, so I defer to those with more 
experience.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review89305
-----------------------------------------------------------


On June 26, 2015, 3:12 a.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated June 26, 2015, 3:12 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> Also fixed the log message in recoveredSlavesTimeout() to correctly
> reference flags.slave_reregister_timeout instead of the unrelated
> ping timeouts.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md aaf65bf 
>   docs/upgrades.md 73e503c 
>   src/master/constants.hpp 072d59c 
>   src/master/constants.cpp 997b792 
>   src/master/flags.hpp 55ed3a9 
>   src/master/flags.cpp 4377715 
>   src/master/master.cpp 0782b54 
>   src/messages/messages.proto 1c8d79e 
>   src/slave/constants.hpp 84927e5 
>   src/slave/constants.cpp d8d2f98 
>   src/slave/slave.hpp f1cf3b8 
>   src/slave/slave.cpp b3e1ccc 
>   src/tests/partition_tests.cpp f7ee3ab 
>   src/tests/slave_recovery_tests.cpp c036e9c 
>   src/tests/slave_tests.cpp e9002e8 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>

Reply via email to