> On June 6, 2019, 9:11 p.m., Benjamin Mahler wrote: > > 3rdparty/libprocess/src/tests/process_tests.cpp > > Lines 2196-2200 (patched) > > <https://reviews.apache.org/r/70782/diff/2/?file=2147867#file2147867line2196> > > > > Oh, this gate doesn't really accomplish much? > > > > I had suggested using it in order to ensure we could fill ProcessA's > > queue with dispatches, and it accomplished that by preventing > > ProcessA::initialize() from completing. > > > > Did that not work? > > > > ``` > > Promise<Nothing> gate; > > > > PID<ProcessA> pid = spawn(new ProcessA(gate.future()), true); > > > > for (size_t i = 0; i < 1000; ++i) { > > dispatch(pid, &ProcessA::f, std::unique_ptr<B>(new B())); > > } > > > > gate.set(Nothing()); > > ``` > > Andrei Sekretenko wrote: > It it the other way round: with the gate added, filling ProcessA's queue > does not increase the deadlock probability. > > Why it is necessary to delay return of `ProcessA::initialize()` in the > current implementation (and why it was necessary to fill ProcessA's queue in > the previous implementation), is actually a very good question. > > > An attempt to add a counter and count the number of `~B()` calls happened > before waiting for `reference->expired()` in the `ProcessManager::cleanup()` > shows that `cleanup()` gets to that point after 3 deleted events on average > in the current implementation. > When I remove the gate, cleanup occurs earlier: after < 1 event on > average. > > Basically, the reliability of this test depends on two factors: > 1. Contention between the dispatches on one side and the > `state.store(TERMINATING)` + `events->consumer.decomission()` in `cleanup()` > on the other side. The slower the body of the inner loop of the test, the > higher the probability of a false negative. > 2. Timing between dispatches and that part of cleanup. If this delay > changes (due to some modifications in libprocess/kernel/hardware/ etc.), this > test might became unable to introduce the race at all, i.e. it will become > invalid. > > These possibilities can be mitigated to some degree: > 1. We can move everything possible out of the inner loop and add more > threads calling dispatch until the cleanup completes (or libprocess > deadlocks). > 2. We can do a longer series of dispatches and call terminate() in > another thread. > > However, I'm not sure if these ideas are sane, as they will significantly > complicate the test setup...
Ok, thanks for looking into this! It seems we don't have a non-fragile test approach, so let's hold off for now. - Benjamin ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70782/#review215728 ----------------------------------------------------------- On June 7, 2019, 6:05 p.m., Andrei Sekretenko wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70782/ > ----------------------------------------------------------- > > (Updated June 7, 2019, 6:05 p.m.) > > > Review request for mesos, Benjamin Mahler, Chun-Hung Hsiao, and Greg Mann. > > > Bugs: MESOS-9808 > https://issues.apache.org/jira/browse/MESOS-9808 > > > Repository: mesos > > > Description > ------- > > Added a non-deterministic test for MESOS-9808. > > > Diffs > ----- > > 3rdparty/libprocess/src/.process.cpp.swp PRE-CREATION > 3rdparty/libprocess/src/tests/process_tests.cpp > 05dc5ec2fdc74a989689e4378bef775bcf2b7a87 > > > Diff: https://reviews.apache.org/r/70782/diff/3/ > > > Testing > ------- > > Without any of two fixes from https://reviews.apache.org/r/70778/ - deadlocks > 100 out of 100 times on the hardware I used. > Without the first fix the deadlock is due to the same reason as initially > observed in in MESOS-9808. > > > With both fixes applied, one run takes around 400 ms on a release build. No > deadlock observed in 1000 runs. > > > Thanks, > > Andrei Sekretenko > >
