Re: Review Request 70782: Added a non-deterministic test for MESOS-9808.

Benjamin Mahler Sun, 09 Jun 2019 19:07:29 -0700


> On June 6, 2019, 9:11 p.m., Benjamin Mahler wrote:
> > 3rdparty/libprocess/src/tests/process_tests.cpp
> > Lines 2196-2200 (patched)
> > <https://reviews.apache.org/r/70782/diff/2/?file=2147867#file2147867line2196>
> >
> >     Oh, this gate doesn't really accomplish much?
> >     
> >     I had suggested using it in order to ensure we could fill ProcessA's 
> > queue with dispatches, and it accomplished that by preventing 
> > ProcessA::initialize() from completing.
> >     
> >     Did that not work?
> >     
> >     ```
> >         Promise<Nothing> gate;
> >     
> >         PID<ProcessA> pid = spawn(new ProcessA(gate.future()), true);
> >     
> >         for (size_t i = 0; i < 1000; ++i) {
> >           dispatch(pid, &ProcessA::f, std::unique_ptr<B>(new B()));
> >         }
> >     
> >         gate.set(Nothing());
> >     ```
> 
> Andrei Sekretenko wrote:
>     It it the other way round: with the gate added, filling ProcessA's queue 
> does not increase the deadlock probability.
>     
>     Why it is necessary to delay return of `ProcessA::initialize()` in the 
> current implementation (and why it was necessary to fill ProcessA's queue in 
> the previous implementation), is actually a very good question.
>     
>     
>     An attempt to add a counter and count the number of `~B()` calls happened 
> before waiting for `reference->expired()` in the `ProcessManager::cleanup()` 
> shows that `cleanup()` gets to that point after 3 deleted events on average 
> in the current implementation.
>     When I remove the gate, cleanup occurs earlier: after < 1 event on 
> average.
>     
>     Basically, the reliability of this test depends on two factors:
>     1. Contention between the dispatches on one side and the 
> `state.store(TERMINATING)` + `events->consumer.decomission()` in `cleanup()` 
> on the other side.  The slower the body of the inner loop of the test, the 
> higher the probability of a false negative.
>     2. Timing between dispatches and that part of cleanup. If this delay 
> changes (due to some modifications in libprocess/kernel/hardware/ etc.), this 
> test might became unable to introduce the race at all, i.e. it will become 
> invalid.
>     
>     These possibilities can be mitigated to some degree:
>     1. We can move everything possible out of the inner loop and add more 
> threads calling dispatch until the cleanup completes (or libprocess 
> deadlocks).
>     2. We can do a longer series of dispatches and call terminate() in 
> another thread.
>     
>     However, I'm not sure if these ideas are sane, as they will significantly 
> complicate the test setup...


Ok, thanks for looking into this! It seems we don't have a non-fragile test 
approach, so let's hold off for now.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70782/#review215728
-----------------------------------------------------------


On June 7, 2019, 6:05 p.m., Andrei Sekretenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70782/
> -----------------------------------------------------------
> 
> (Updated June 7, 2019, 6:05 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Chun-Hung Hsiao, and Greg Mann.
> 
> 
> Bugs: MESOS-9808
>     https://issues.apache.org/jira/browse/MESOS-9808
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added a non-deterministic test for MESOS-9808.
> 
> 
> Diffs
> -----
> 
>   3rdparty/libprocess/src/.process.cpp.swp PRE-CREATION 
>   3rdparty/libprocess/src/tests/process_tests.cpp 
> 05dc5ec2fdc74a989689e4378bef775bcf2b7a87 
> 
> 
> Diff: https://reviews.apache.org/r/70782/diff/3/
> 
> 
> Testing
> -------
> 
> Without any of two fixes from https://reviews.apache.org/r/70778/ - deadlocks 
> 100 out of 100 times on the hardware I used. 
> Without the first fix the deadlock is due to the same reason as initially 
> observed in in MESOS-9808.
> 
> 
> With both fixes applied, one run takes around 400 ms on a release build. No 
> deadlock observed in 1000 runs.
> 
> 
> Thanks,
> 
> Andrei Sekretenko
> 
>

Re: Review Request 70782: Added a non-deterministic test for MESOS-9808.

Reply via email to