Re: Test failures are out of control......

Erick Erickson Fri, 23 Feb 2018 09:07:18 -0800

Testing distributed systems requires, well, distributed systems which
is what starting clusters is all about. The great leap of faith of
individual-method unit testing is that if all the small parts are
tested, combining them in various ways will "just work". This is
emphatically not true with distributed systems.

Which is also one of the reasons some of the tests are long. It takes
time (as you pointed out) to set up a cluster. So once a cluster is
started, testing a bunch of things amortizes the expense of setting up
the cluster. If each test of some bit of distributed functionality set
up and tore down a cluster, that would extend the time it takes to run
a full test suite by quite a bit. Note this is mostly a problem in
Solr, Lucene tests tend to run much faster.

What Dawid said about randomness. All the randomization functions are
controlled by the "seed", that's what the "reproduce with" line in the
results is all about.  That "controlled randomization" has uncovered
any number of bugs for obscure things that would have been vastly more
painful to discover otherwise. One example I remember went along the
lines of "this particular functionality is broken when op systems X
thinks it's in the Turkish locale". Which is _also_ why all tests must
use the framework random() method provided by LuceneTestCase and never
the Java random functions.

For that matter, one _other_ problem uncovered by the randomness is
that tests in a suite are executed in different order with different
seeds, so side effects of one test method that would affect another
are flushed out.

Mind you, this doesn't help with race conditions that are sensitive
to, say, the clock speed of the machine you're running on....

All that said, there's plenty of room for improving our tests. I'm
sure there are tests that spin up a cluster that don't need to.  All
patches welcome of course.

Best,
Erick

On Fri, Feb 23, 2018 at 8:20 AM, Dawid Weiss <dawid.we...@gmail.com> wrote:
>> Randomness makes it difficult to correlate a failure to the commit that made
>> the test to fail (as was pointed out earlier in the discussion). If each
>> execution path is different, it may very well be that a failure you
>> experience is introduced several commits ago, so it may not be your fault.
>
> This is true only to a certain degree. If you  don't randomize all you
> do is essentially run a fixed scenario. This protects you against a
> regression in this particular state, but it doesn't help in
> discovering new corner cases or environment quirks, which would be
> prohibitive to run as a full Cartesian product of all possibilities.
> So there is a tradeoff here and most folks in this project have agreed
> to it. If you look at how many problems randomization have helped
> discover I think it's a good tradeoff.
>
> Finally: your scenario can be actually reproduced with ease. Run the
> tests with a fixed seed before you apply a patch and after you apply
> it... if there is no regression you can assume your patch is fine (but
> it doesn't mean it won't fail later on on a different seed, which
> nobody will blame you for).
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Test failures are out of control......

Reply via email to