> On Jan. 7, 2015, 2:50 p.m., Ben Mahler wrote:
> > src/master/flags.hpp, lines 361-373
> > <https://reviews.apache.org/r/29507/diff/1/?file=804645#file804645line361>
> >
> >     Do you mind changing the names now that they're merely defaults?
> >     
> >     s/SLAVE_PING_TIMEOUT/DEFAULT_SLAVE_PING_TIMEOUT/
> >     s/MAX_SLAVE_PING_TIMEOUTS/DEFAULT_MAX_SLAVE_PING_TIMEOUTS/
> >     
> >     Once you do this, it will become apparent there is more needed in this 
> > change. In particular, all the slave logic that depends on 
> > MASTER_PING_TIMEOUT() is broken now! Making it configurable will require 
> > some more thought.

Renamed to make the DEFAULT nature explicit.

I would argue that with this change as is, the slave side of the ping timeout 
isn't much more broken than it already was. As I alluded to in the MESOS-2110 
JIRA description, the slave uses the timeout to force a leader-detection & 
re-registration, which is only loosely related to what the master is doing with 
the timeout (declaring the slave shutdown). Even without my changes, the Timers 
on master/slave could be out of sync and the two timeout events aren't 
guaranteed to happen in a particular order. If the master were to timeout 
first, and the slave did eventually detect a new leader, the slave would still 
be told to shutdown and its re-registration would fail. If the slave timed out 
first, it would try to re-register with the currently verified leader (could be 
the same), and the master's SlaveObserver would keep trying to ping the slave 
and potentially eventually shut it down. That's why I would argue that this 
patch is fine on its own, and a subsequent patch (same JIRA?) should c
 lean up the slave's ping timeout behavior.

`src/master/constants.hpp` has the following note:
```
// NOTE: The slave uses these PING constants to determine when
// the master has stopped sending pings. If these are made
// configurable, then we'll need to rely on upper/lower bounds
// to ensure that the slave is not unnecessarily triggering
// re-registrations.
```
If the master flags are configurable, there's currently no way for the slave to 
know what value the current leading master is using, especially if the slave is 
disconnected at the time. After reading through the comments in MESOS-1529 
(which added the slave-side ping timeout), I'm thinking that maybe the 
master->slave registered (and PING?) messages could embed the current master's 
actual ping timeout. Then the pingTimers started from Slave::registered() and 
Slave::ping() could use that value (same as master's) to prevent unnecessary 
reregistration due to a timeout mismatch.

Thoughts, suggestions?


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review67113
-----------------------------------------------------------


On Dec. 31, 2014, 1:38 a.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated Dec. 31, 2014, 1:38 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2150
>     https://issues.apache.org/jira/browse/MESOS-2150
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp f5c8d2a8cc7530bc8746935af9ea90af747cc111 
>   src/master/master.cpp d6651e299ddb73bfdc1b126c474075db6cda8acd 
>   src/tests/fault_tolerance_tests.cpp 
> 5763486acb6d687b50c02c01ea00e1cfbea48421 
>   src/tests/mesos.cpp 3b98c69a604132be71a60fbbee4a47b51fe6956a 
>   src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 
>   src/tests/slave_recovery_tests.cpp cd4a398ef680b5694cb6069b8e2ca4e2c05911d1 
>   src/tests/slave_tests.cpp c50cbc799d4793243f184f9fe628b69a020adc66 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>

Reply via email to