> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > Overall it's looking pretty good! I'd suggest splitting out the tests given 
> > the comments on the source changes are fairly minor.
> > 
> > Echoing Nik, it would be nice to have some sanity bound checks for this 
> > flag, to prevent operators from burning themselves.

Will do. This next revision excludes the new tests, and I will address the test 
comments in a subsequent patch/review.

Added flag validation using the cool new lambda style from 
https://reviews.apache.org/r/34943


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/slave/slave.hpp, lines 100-109
> > <https://reviews.apache.org/r/29507/diff/7/?file=973886#file973886line100>
> >
> >     Ditto Nik's comment for camelCase, but also, could you update these to 
> > use `Duration` rather than directly using a `double`?

Removed `double ping_timeout_seconds` in favor of the nested submessage 
`connection.ping_timeout_seconds()` as you suggested later.


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, lines 885-888
> > <https://reviews.apache.org/r/29507/diff/7/?file=973887#file973887line885>
> >
> >     Could you move this up and do it once before the switch?
> >     
> >     I see now why you probably wanted the explicit default, it's too bad we 
> > don't have support to bind in `Option`s for optional fields in install 
> > handlers. How about we rely on the implicit default: 
> > https://developers.google.com/protocol-buffers/docs/proto#optional
> >     
> >     Alternatively, you could have a nested message that contains the fields 
> > (e.g. 'Connection') which allows you to explicitly check for 
> > '`has_ping_timeout_seconds`').

Excellent points. I moved the check up before the switch, and I went with your 
nested submessage approach instead of an optional default. I think that reads 
much better.
And yes, I wanted the explicit default so I could check for a registration from 
a pre-0.23 master, to maintain the old behavior. I just realized that even this 
was not enough, in case this slave had registered with a newer master, then 
reregistered with an older master, so I have explicitly added an else case to 
reset it to the old DEFAULT_MASTER_PING_TIMEOUT().


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/messages/messages.proto, lines 277-288
> > <https://reviews.apache.org/r/29507/diff/7/?file=973883#file973883line277>
> >
> >     The defaults here seem to suggest that we will rely on them (i.e. we 
> > don't check if the field is set, rather we always grab the value). However, 
> > 0 is going to be a problematic value! I'd suggest removing the defaults, 
> > and rather, adding a comment here for posterity that indicates that these 
> > were added in 0.23.0.

Addressed with your suggestion of a nested submessage instead.


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 923
> > <https://reviews.apache.org/r/29507/diff/7/?file=973887#file973887line923>
> >
> >     "once re-registered"?

Fixed.


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/tests/partition_tests.cpp, lines 600-602
> > <https://reviews.apache.org/r/29507/diff/7/?file=973888#file973888line600>
> >
> >     I'm a bit confused by this test, why are you expecting the first ping? 
> > Why are you dropping the pongs?
> >     
> >     I would expect the pings to be dropped entirely here so that advancing 
> > the clock after a slave is registered leads to a re-registration.
> >     
> >     Also, did you know about SlaveTest.PingTimeoutNoPings and 
> > SlaveTest.PingTimeoutSomePings tests? Is there something being tested here 
> > that is not captured in these?

This test was intended to simulate a one-way partition where the slave can no 
longer reach the master, after the slave initially successfully registered. 
Hence the PONGs are dropped, but not the PINGs.


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/tests/partition_tests.cpp, lines 647-650
> > <https://reviews.apache.org/r/29507/diff/7/?file=973888#file973888line647>
> >
> >     This sounds like testing two particular behaviors in a single test:
> >     
> >     (1) A registered slave that never receives any Ping messages will 
> > attempt to reregister after the master's ping timeout.
> >     
> >     (2) Master will consider it shutdown after the same timeout.
> >     
> >     I believe we already have tests for both of these. For (1) we could 
> > just add the shutdown message expectation to 
> > `PartitionTest.PartitionedSlave`. For (2), we have 
> > SlaveTest.PingTimeoutNoPings, SlaveTest.PingTimeoutNoPings.

Yes, I see now that SlaveTest.PingTimeoutNoPings already satisfies (1), and 
PartitionTest.PartitionedSlave could be modified to check for the 
ShutdownMessage() to satisfy (2).
I originally started trying to write unit tests for registered/unregistered 
slaves that encountered each type of a one-way partition, with both shorter and 
longer-than-default ping timeouts, largely for my own understanding of the 
process. Testing longer timeouts proved unnecessary, and testing unregistered 
slaves didn't actually test my feature. I posted these two since they worked 
and explicitly tested the one-way partitions I was trying to test. But this one 
can almost certainly be covered by the tests you mentioned. I will update in a 
follow-up patch.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review86461
-----------------------------------------------------------


On May 28, 2015, 4:13 p.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated May 28, 2015, 4:13 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> Also fixed the log message in recoveredSlavesTimeout() to correctly
> reference flags.slave_reregister_timeout instead of the unrelated
> ping timeouts.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md 5a41477 
>   docs/upgrades.md 355307a 
>   src/master/constants.hpp 57cf8fb 
>   src/master/constants.cpp 8c7174a 
>   src/master/flags.hpp 84fa238 
>   src/master/flags.cpp 49d953a 
>   src/master/master.cpp 1526f59 
>   src/messages/messages.proto 39dac72 
>   src/slave/constants.hpp 206d439 
>   src/slave/constants.cpp d8d2f98 
>   src/slave/slave.hpp 0207eaf 
>   src/slave/slave.cpp b4d2029 
>   src/tests/partition_tests.cpp f7ee3ab 
>   src/tests/slave_recovery_tests.cpp c036e9c 
>   src/tests/slave_tests.cpp acae497 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and 
> ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>

Reply via email to