Re: Review Request 29507: Added Configurable Slave Ping Timeouts

Adam B Wed, 07 Jan 2015 00:10:43 -0800


> On Jan. 6, 2015, 9:17 a.m., Niklas Nielsen wrote:
> > src/tests/mesos.cpp, line 129
> > <https://reviews.apache.org/r/29507/diff/1/?file=804648#file804648line129>
> >
> >     Can we use a constant here instead (to make sure it follows future 
> > changes to the default value?)


The point of setting these values here was for them to be _different_ from the 
default value (3x5s vs. 5x15s). I could a) create constants that aren't used 
anywhere else yet, but might be wanted by some test in the future; b) divide 
the default by some constant factor, like what registry_store_timeout does; or 
c) just use let our unit tests use the default values. I'm leaning towards (c), 
since I don't think any of our tests care about the exact value, and any future 
test that does can override the default itself.


> On Jan. 6, 2015, 9:17 a.m., Niklas Nielsen wrote:
> > src/master/master.cpp, line 1111
> > <https://reviews.apache.org/r/29507/diff/1/?file=804646#file804646line1111>
> >
> >     Can you expand a bit on this change?

Why certainly. I believe we have been logging the wrong value here, ever since 
--recovery_slave_removal_limit was introduced in 
https://reviews.apache.org/r/20097/diff/#. @bmahler correct me if I'm wrong.
Once the registrar has recovered, `Master::_recover` starts a `delay` timer for 
`flags.slave_reregister_timeout`, after which point it calls 
`Master::recoveredSlavesTimeout`, which is the method here that could exit with 
the message "After _X_ there were [an unacceptable number of] slaves recovered 
from the registry that did not reregister". This after `_recover` has already 
logged "Recovered ... slaves from the Registry; allowing 
[slave_reregister_timeout] for slaves to re-register".
The slave ping timeouts determine when the master considers a slave shutdown, 
sends SLAVE_LOST, and no longer accepts re-registrations; whereas the slave 
reregister timeout determines how soon after leader-election (and/or recovery) 
the master should expect slaves to reregister. This log message describes the 
latter scenario, so I corrected the value we were logging.
Make sense?


> On Jan. 6, 2015, 9:17 a.m., Niklas Nielsen wrote:
> > src/master/flags.hpp, line 406
> > <https://reviews.apache.org/r/29507/diff/1/?file=804645#file804645line406>
> >
> >     Why uint32 and not int?

Mostly because the original default value constant is 
`src/master/constants.cpp:35:const uint32_t MAX_SLAVE_PING_TIMEOUTS = 5;`
My guess why _that_ was uint32? Well, any count variable (like the existing 
`uint32 SlaveObserver::timeout`) should never be negative and valid, so 
unsigned makes sense. I would think that one shouldn't need more than 255 
timeouts (failed pings) before declaring a slave shutdown, so a uchar/byte 
would be sufficient, but maybe those bytes aren't worth saving, compared to 
having more bits to express a larger pseudo-infinity.

Now does the flag itself need to be uint32? Well, doing so puts the burden of 
size/sign validation on the flag-parsing code instead of wherever I end up 
doing the conversion.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review66859
-----------------------------------------------------------


On Dec. 31, 2014, 1:38 a.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated Dec. 31, 2014, 1:38 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2150
>     https://issues.apache.org/jira/browse/MESOS-2150
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp f5c8d2a8cc7530bc8746935af9ea90af747cc111 
>   src/master/master.cpp d6651e299ddb73bfdc1b126c474075db6cda8acd 
>   src/tests/fault_tolerance_tests.cpp 
> 5763486acb6d687b50c02c01ea00e1cfbea48421 
>   src/tests/mesos.cpp 3b98c69a604132be71a60fbbee4a47b51fe6956a 
>   src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 
>   src/tests/slave_recovery_tests.cpp cd4a398ef680b5694cb6069b8e2ca4e2c05911d1 
>   src/tests/slave_tests.cpp c50cbc799d4793243f184f9fe628b69a020adc66 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>

Re: Review Request 29507: Added Configurable Slave Ping Timeouts

Reply via email to