> On Jan. 6, 2015, 9:17 a.m., Niklas Nielsen wrote: > > src/tests/mesos.cpp, line 129 > > <https://reviews.apache.org/r/29507/diff/1/?file=804648#file804648line129> > > > > Can we use a constant here instead (to make sure it follows future > > changes to the default value?)
The point of setting these values here was for them to be _different_ from the default value (3x5s vs. 5x15s). I could a) create constants that aren't used anywhere else yet, but might be wanted by some test in the future; b) divide the default by some constant factor, like what registry_store_timeout does; or c) just use let our unit tests use the default values. I'm leaning towards (c), since I don't think any of our tests care about the exact value, and any future test that does can override the default itself. > On Jan. 6, 2015, 9:17 a.m., Niklas Nielsen wrote: > > src/master/master.cpp, line 1111 > > <https://reviews.apache.org/r/29507/diff/1/?file=804646#file804646line1111> > > > > Can you expand a bit on this change? Why certainly. I believe we have been logging the wrong value here, ever since --recovery_slave_removal_limit was introduced in https://reviews.apache.org/r/20097/diff/#. @bmahler correct me if I'm wrong. Once the registrar has recovered, `Master::_recover` starts a `delay` timer for `flags.slave_reregister_timeout`, after which point it calls `Master::recoveredSlavesTimeout`, which is the method here that could exit with the message "After _X_ there were [an unacceptable number of] slaves recovered from the registry that did not reregister". This after `_recover` has already logged "Recovered ... slaves from the Registry; allowing [slave_reregister_timeout] for slaves to re-register". The slave ping timeouts determine when the master considers a slave shutdown, sends SLAVE_LOST, and no longer accepts re-registrations; whereas the slave reregister timeout determines how soon after leader-election (and/or recovery) the master should expect slaves to reregister. This log message describes the latter scenario, so I corrected the value we were logging. Make sense? > On Jan. 6, 2015, 9:17 a.m., Niklas Nielsen wrote: > > src/master/flags.hpp, line 406 > > <https://reviews.apache.org/r/29507/diff/1/?file=804645#file804645line406> > > > > Why uint32 and not int? Mostly because the original default value constant is `src/master/constants.cpp:35:const uint32_t MAX_SLAVE_PING_TIMEOUTS = 5;` My guess why _that_ was uint32? Well, any count variable (like the existing `uint32 SlaveObserver::timeout`) should never be negative and valid, so unsigned makes sense. I would think that one shouldn't need more than 255 timeouts (failed pings) before declaring a slave shutdown, so a uchar/byte would be sufficient, but maybe those bytes aren't worth saving, compared to having more bits to express a larger pseudo-infinity. Now does the flag itself need to be uint32? Well, doing so puts the burden of size/sign validation on the flag-parsing code instead of wherever I end up doing the conversion. - Adam ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review66859 ----------------------------------------------------------- On Dec. 31, 2014, 1:38 a.m., Adam B wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/29507/ > ----------------------------------------------------------- > > (Updated Dec. 31, 2014, 1:38 a.m.) > > > Review request for mesos and Niklas Nielsen. > > > Bugs: MESOS-2150 > https://issues.apache.org/jira/browse/MESOS-2150 > > > Repository: mesos-git > > > Description > ------- > > Added new --slave_ping_timeout and --max_slave_ping_timeouts flags > to mesos-master to replace the existing (still default) > SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5). > > These can be extended if slaves are expected/allowed to be down for > longer than a minute or two. > > Beware that this affects recovery from network timeouts as well as > actual slave node/process failover. > > > Diffs > ----- > > src/master/flags.hpp f5c8d2a8cc7530bc8746935af9ea90af747cc111 > src/master/master.cpp d6651e299ddb73bfdc1b126c474075db6cda8acd > src/tests/fault_tolerance_tests.cpp > 5763486acb6d687b50c02c01ea00e1cfbea48421 > src/tests/mesos.cpp 3b98c69a604132be71a60fbbee4a47b51fe6956a > src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 > src/tests/slave_recovery_tests.cpp cd4a398ef680b5694cb6069b8e2ca4e2c05911d1 > src/tests/slave_tests.cpp c50cbc799d4793243f184f9fe628b69a020adc66 > > Diff: https://reviews.apache.org/r/29507/diff/ > > > Testing > ------- > > Manually tested slave failover/shutdown with master using different > --slave_ping_timeout and --max_slave_ping_timeouts. > Updated unit tests to use shorter non-default values for ping timeouts. > > > Thanks, > > Adam B > >