> On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
> > LGTM - Would it make sense to have sane min/max values for the 
> > timeouts/counts?
> > 
> > I wonder it would make sense to have a test to exercise an upgrade path 
> > (the timeout being different in the slaves, than in the master). Maybe I 
> > missed that in your first test.

Added flag validation using the cool new lambda style from 
https://reviews.apache.org/r/34943

It's not so easy to test the upgrade path in a unit test, since as soon as a 
slave from this source base gets its SlaveRegistered ack from the master, it 
then has the same timeout as the master (as designed). And if the slave never 
receives its registered ack, it will retry registration after a /random/ 
backoff initially capped at 1min (--registration_backoff_factor). If the random 
backoff_factor ends up being longer than the slave's ping timeout, when the 
ping timeout expires, the slave will anyway re-detect the master and retry 
registration. As I revisit the tests (in a follow-up patch), I will brainstorm 
ways to test the upgrade path.


> On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
> > src/messages/messages.proto, line 279
> > <https://reviews.apache.org/r/29507/diff/7/?file=973883#file973883line279>
> >
> >     Why a zero default?

Originally added as a way for the recipient of this message to know that the 
sender didn't set it, presumably because the sender is pre-0.23. Since we don't 
have a way for install handlers to translate optional fields into Options 
instead of just pulling in the default value, I couldn't think of a cleaner 
option. BenM, however, suggested pushing the ping timeout into a nested 
submessage, so I could just check has_ping_timeout_seconds() instead. Went with 
that instead.


> On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
> > src/slave/slave.cpp, line 864
> > <https://reviews.apache.org/r/29507/diff/7/?file=973887#file973887line864>
> >
> >     If we removed the 0 default, we'd do:
> >     
> >     if (ping_timeout_seconds.isSome()) {
> >       // ...
> >     }
> >     
> >     Which seems a bit more logical to me :)

Unfortunately, we don't currently translate optional protobuf fields into 
Options in install handlers, so I went for an explicit default. BenM suggested 
using a nested message so I can check `if 
(connection.has_ping_timeout_seconds())` instead. Much more readable.


> On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
> > docs/configuration.md, line 339
> > <https://reviews.apache.org/r/29507/diff/7/?file=973876#file973876line339>
> >
> >     As this will be markdown, how about formatting max_slave_ping_timeouts 
> > in `` or as "--max_slave_ping_timeouts" ?

Fixed.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review86039
-----------------------------------------------------------


On May 28, 2015, 4:13 p.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated May 28, 2015, 4:13 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> Also fixed the log message in recoveredSlavesTimeout() to correctly
> reference flags.slave_reregister_timeout instead of the unrelated
> ping timeouts.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md 5a41477 
>   docs/upgrades.md 355307a 
>   src/master/constants.hpp 57cf8fb 
>   src/master/constants.cpp 8c7174a 
>   src/master/flags.hpp 84fa238 
>   src/master/flags.cpp 49d953a 
>   src/master/master.cpp 1526f59 
>   src/messages/messages.proto 39dac72 
>   src/slave/constants.hpp 206d439 
>   src/slave/constants.cpp d8d2f98 
>   src/slave/slave.hpp 0207eaf 
>   src/slave/slave.cpp b4d2029 
>   src/tests/partition_tests.cpp f7ee3ab 
>   src/tests/slave_recovery_tests.cpp c036e9c 
>   src/tests/slave_tests.cpp acae497 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>

Reply via email to