> On June 24, 2015, 6:41 p.m., Ben Mahler wrote:
> > Actually, we should think about one more thing, how does this interact with 
> > the zookeeper session timeout?
> 
> Adam B wrote:
>     The hardcoded individual ping timeout (15secs) was previously longer than 
> the default zk session timeout (10secs), but the zk session timeout is 
> already configurable with no bounds/validation, so users could already change 
> it to be longer than an individual ping timeout, or even the total ping 
> timeout. Is this a bad idea? What would this mean?
>     This is mostly relevant to the slave-side (as a leader detector), in that 
> a slave may timeout and decide to check for a new leading master to 
> reregister with, but the zk session may have recently gone bad but not yet 
> timed out. Since a zk session could go bad/timeout at any point along the 
> way, I don't think making the ping timeout configurable will introduce any 
> new potential errors. Rather, it would just be advisable to keep the zk 
> session timeouts reasonably small so that the zk session is likely to be 
> healthy whenever a slave needs to detect a new leader. We could introduce 
> validation that the zk_session_timeout is shorter than the (total) ping 
> timeout, but I'm not sure that's even necessary.
>     Does this make sense? I'm no ZK expert, so I defer to those with more 
> experience.
> 
> Ben Mahler wrote:
>     Right, the ZK timeout is negotiated with ZooKeeper, so it gets bounded 
> based on the ZK server configuration:
>     http://zookeeper.apache.org/doc/r3.3.1/zookeeperAdmin.html
>     
>     Validating that the session timeout is less than the total ping timeout 
> might be nice, otherwise every session expiration slave-side will trigger 
> useless re-registration attempts. Mind just adding a NOTE that reflects this 
> in the flag help? The validation might be nice, but the ZK session timeout is 
> only really determined after negotation anyway..

Noted in the flag help.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review89305
-----------------------------------------------------------


On June 26, 2015, 3:12 a.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated June 26, 2015, 3:12 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> Also fixed the log message in recoveredSlavesTimeout() to correctly
> reference flags.slave_reregister_timeout instead of the unrelated
> ping timeouts.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md aaf65bf 
>   docs/upgrades.md 73e503c 
>   src/master/constants.hpp 072d59c 
>   src/master/constants.cpp 997b792 
>   src/master/flags.hpp 55ed3a9 
>   src/master/flags.cpp 4377715 
>   src/master/master.cpp 0782b54 
>   src/messages/messages.proto 1c8d79e 
>   src/slave/constants.hpp 84927e5 
>   src/slave/constants.cpp d8d2f98 
>   src/slave/slave.hpp f1cf3b8 
>   src/slave/slave.cpp b3e1ccc 
>   src/tests/partition_tests.cpp f7ee3ab 
>   src/tests/slave_recovery_tests.cpp c036e9c 
>   src/tests/slave_tests.cpp e9002e8 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>

Reply via email to