> On Nov. 7, 2018, 1:44 p.m., Joseph Wu wrote: > > src/tests/slave_recovery_tests.cpp > > Line 4827 (original), 4841-4845 (patched) > > <https://reviews.apache.org/r/69273/diff/1/?file=2104774#file2104774line4842> > > > > There is always a non-zero delay between the agent's startup and > > subscribing to the master: > > > > https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1306-L1321 > > > > There isn't a great way to wait for the agent to detect the master, and > > then advance the clock. Instead, try setting > > `slaveFlags.registration_backoff_factor = Seconds(0);`. I think that > > should bypass this small subscription delay.
Ooh, looks like there is a second delay in the recovery phase: https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L7258-L7260 To get around this, I added this before destroying the agent: ``` // This test will proceed once the executor has reconnected // after agent failover. Future<ReregisterExecutorMessage> reregisterExecutorMessage = FUTURE_PROTOBUF(ReregisterExecutorMessage(), _, _); ``` And then changed this block to: ``` // Wait for the executor and then skip the timer that triggers removal // of executors that did not connect (none). AWAIT_READY(reregisterExecutorMessage); Clock::advance(slaveFlags.executor_reregistration_timeout); AWAIT_READY(slaveReregistered); ``` - Joseph ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69273/#review210383 ----------------------------------------------------------- On Nov. 12, 2018, 6:33 a.m., Benno Evers wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69273/ > ----------------------------------------------------------- > > (Updated Nov. 12, 2018, 6:33 a.m.) > > > Review request for mesos, Greg Mann and Joseph Wu. > > > Bugs: MESOS-9358 > https://issues.apache.org/jira/browse/MESOS-9358 > > > Repository: mesos > > > Description > ------- > > Removed some flakyness from the test > SlaveRecoveryTest.AgentReconfigurationWithRunningTask > by removing the `refuse_offers` filter and by pausing > the clock whenever possible during the test. > > > Diffs > ----- > > src/tests/slave_recovery_tests.cpp 5842ccffaf8c409aaa9c84720ba6c7b07ba6dc7c > > > Diff: https://reviews.apache.org/r/69273/diff/2/ > > > Testing > ------- > > `./src/mesos-tests --gtest_filter="*ReconfigurationWithRunning*" > --gtest_repeat=200` > > > Thanks, > > Benno Evers > >
