> On June 3, 2015, 2:24 p.m., Ben Mahler wrote: > > Overall it's looking pretty good! I'd suggest splitting out the tests given > > the comments on the source changes are fairly minor. > > > > Echoing Nik, it would be nice to have some sanity bound checks for this > > flag, to prevent operators from burning themselves.
Will do. This next revision excludes the new tests, and I will address the test comments in a subsequent patch/review. Added flag validation using the cool new lambda style from https://reviews.apache.org/r/34943 > On June 3, 2015, 2:24 p.m., Ben Mahler wrote: > > src/slave/slave.hpp, lines 100-109 > > <https://reviews.apache.org/r/29507/diff/7/?file=973886#file973886line100> > > > > Ditto Nik's comment for camelCase, but also, could you update these to > > use `Duration` rather than directly using a `double`? Removed `double ping_timeout_seconds` in favor of the nested submessage `connection.ping_timeout_seconds()` as you suggested later. > On June 3, 2015, 2:24 p.m., Ben Mahler wrote: > > src/slave/slave.cpp, lines 885-888 > > <https://reviews.apache.org/r/29507/diff/7/?file=973887#file973887line885> > > > > Could you move this up and do it once before the switch? > > > > I see now why you probably wanted the explicit default, it's too bad we > > don't have support to bind in `Option`s for optional fields in install > > handlers. How about we rely on the implicit default: > > https://developers.google.com/protocol-buffers/docs/proto#optional > > > > Alternatively, you could have a nested message that contains the fields > > (e.g. 'Connection') which allows you to explicitly check for > > '`has_ping_timeout_seconds`'). Excellent points. I moved the check up before the switch, and I went with your nested submessage approach instead of an optional default. I think that reads much better. And yes, I wanted the explicit default so I could check for a registration from a pre-0.23 master, to maintain the old behavior. I just realized that even this was not enough, in case this slave had registered with a newer master, then reregistered with an older master, so I have explicitly added an else case to reset it to the old DEFAULT_MASTER_PING_TIMEOUT(). > On June 3, 2015, 2:24 p.m., Ben Mahler wrote: > > src/messages/messages.proto, lines 277-288 > > <https://reviews.apache.org/r/29507/diff/7/?file=973883#file973883line277> > > > > The defaults here seem to suggest that we will rely on them (i.e. we > > don't check if the field is set, rather we always grab the value). However, > > 0 is going to be a problematic value! I'd suggest removing the defaults, > > and rather, adding a comment here for posterity that indicates that these > > were added in 0.23.0. Addressed with your suggestion of a nested submessage instead. > On June 3, 2015, 2:24 p.m., Ben Mahler wrote: > > src/slave/slave.cpp, line 923 > > <https://reviews.apache.org/r/29507/diff/7/?file=973887#file973887line923> > > > > "once re-registered"? Fixed. > On June 3, 2015, 2:24 p.m., Ben Mahler wrote: > > src/tests/partition_tests.cpp, lines 600-602 > > <https://reviews.apache.org/r/29507/diff/7/?file=973888#file973888line600> > > > > I'm a bit confused by this test, why are you expecting the first ping? > > Why are you dropping the pongs? > > > > I would expect the pings to be dropped entirely here so that advancing > > the clock after a slave is registered leads to a re-registration. > > > > Also, did you know about SlaveTest.PingTimeoutNoPings and > > SlaveTest.PingTimeoutSomePings tests? Is there something being tested here > > that is not captured in these? This test was intended to simulate a one-way partition where the slave can no longer reach the master, after the slave initially successfully registered. Hence the PONGs are dropped, but not the PINGs. > On June 3, 2015, 2:24 p.m., Ben Mahler wrote: > > src/tests/partition_tests.cpp, lines 647-650 > > <https://reviews.apache.org/r/29507/diff/7/?file=973888#file973888line647> > > > > This sounds like testing two particular behaviors in a single test: > > > > (1) A registered slave that never receives any Ping messages will > > attempt to reregister after the master's ping timeout. > > > > (2) Master will consider it shutdown after the same timeout. > > > > I believe we already have tests for both of these. For (1) we could > > just add the shutdown message expectation to > > `PartitionTest.PartitionedSlave`. For (2), we have > > SlaveTest.PingTimeoutNoPings, SlaveTest.PingTimeoutNoPings. Yes, I see now that SlaveTest.PingTimeoutNoPings already satisfies (1), and PartitionTest.PartitionedSlave could be modified to check for the ShutdownMessage() to satisfy (2). I originally started trying to write unit tests for registered/unregistered slaves that encountered each type of a one-way partition, with both shorter and longer-than-default ping timeouts, largely for my own understanding of the process. Testing longer timeouts proved unnecessary, and testing unregistered slaves didn't actually test my feature. I posted these two since they worked and explicitly tested the one-way partitions I was trying to test. But this one can almost certainly be covered by the tests you mentioned. I will update in a follow-up patch. - Adam ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review86461 ----------------------------------------------------------- On May 28, 2015, 4:13 p.m., Adam B wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/29507/ > ----------------------------------------------------------- > > (Updated May 28, 2015, 4:13 p.m.) > > > Review request for mesos, Ben Mahler and Niklas Nielsen. > > > Bugs: MESOS-2110 > https://issues.apache.org/jira/browse/MESOS-2110 > > > Repository: mesos > > > Description > ------- > > Added new --slave_ping_timeout and --max_slave_ping_timeouts flags > to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs) > and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5). > > These can be extended if slaves are expected/allowed to be down for > longer than a minute or two. > > Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage. > > Beware that this affects recovery from network timeouts as well as > actual slave node/process failover. > > Also fixed the log message in recoveredSlavesTimeout() to correctly > reference flags.slave_reregister_timeout instead of the unrelated > ping timeouts. > > > Diffs > ----- > > docs/configuration.md 5a41477 > docs/upgrades.md 355307a > src/master/constants.hpp 57cf8fb > src/master/constants.cpp 8c7174a > src/master/flags.hpp 84fa238 > src/master/flags.cpp 49d953a > src/master/master.cpp 1526f59 > src/messages/messages.proto 39dac72 > src/slave/constants.hpp 206d439 > src/slave/constants.cpp d8d2f98 > src/slave/slave.hpp 0207eaf > src/slave/slave.cpp b4d2029 > src/tests/partition_tests.cpp f7ee3ab > src/tests/slave_recovery_tests.cpp c036e9c > src/tests/slave_tests.cpp acae497 > > Diff: https://reviews.apache.org/r/29507/diff/ > > > Testing > ------- > > Manually tested slave failover/shutdown with master using different > --slave_ping_timeout and --max_slave_ping_timeouts. > Ran unit tests with shorter non-default values for ping timeouts. > `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and > ShortPingTimeoutUnreachableSlave > > > Thanks, > > Adam B > >