Just an FYI on a plan I have to try and improve Solr’s test suite:

Many years back, a few changes happened that made getting a handle on
Solr’s tests quite difficult.

We started running tests in parallel.

We added a random testing framework and mentality.

We started running a few Jenkins all the time rather than one or few runs
per day.

We added a lot of distributed code and tests with timeouts and many
complicated interactions.

The results have been great on many fronts, but the number of fails
produced and the current Jenkins reporting has made the Solr test suite
quite hard to get a handle on.

It is much too hard to tell even some basic things: what tests are flakey
right now? which test fails actually affect devs most (fails that happen
even in a clean, well resourced environment)? did I break it? was that test
already flakey? is that test still flakey? what are our worst tests right
now? is that test getting better or worse? Is it a bad test or just a bad
test under low resources? How many tests are flakey?

The stream of email fails is easy to ignore and hard to follow. Even if you
do follow, the information is an always changing stream that is difficult
to summarize. (though we can do things here too, a while back I sent a
couple test reports by doing simple regexes on emails and trying to count
up test fails).


Our nightly tests also get little to no visibility due to all the fails.
Sometimes some of those tests are simply broken because changes were made
that didn't account for them. We need a way to highlight @Nightly only test
fails.

I, like many others, have spent a lot of time trying to improve our test
stability over the years. But it’s whack a mole and often feels like it’s
hard to make real progress on hardening the whole suite.

Part of the issue is that Solr has almost 900 tests. If even just 10-20 of
them fail 1 out 30 or 1 out of 60 runs, that is going to produce plenty of
‘ant test’ fails when many Jenkins machines are blasting all day.


For individual tests, one way we have found that works well for hardening
is to ‘beast’ the test. Or run it many times, ideally in parallel. It’s
been in my head a long time, but I have finally built upon that strategy
and extended it out to a full ‘test run’ of beasting.

By beasting every test, I am able to generate a test report based on a
single commit point that objectively scores each test. Currently I am doing
30 test runs, 10 at a time. Eventually I’d like to up that to at least 50,
but of course I can also just go higher for the now growing list of known
flakey tests. We also will have some easy to reference history though.
Tests will carry a checkable reputation.

Also, unlike many Jenkins failures, reproducing is usually quite simple.
Just beast it again. Sometimes you might have to do it more than once or
for more runs, but generally it's easy to pop up the fail again or test a
fix. So far over the years, I have never hit a fail while beasting that I
couldn't hit again, even very rare chaos monkey fails that would take
100-300 runs to hit.

Anyway, I’m working on this strategy here: SOLR-10032: Create report to
assess Solr test quality at a commit point.

I am about to put up my 3rd report, and I’m going to soon start summarizing
these reports and pinging the dev list with the results. We will see where
it goes, but I think we have few enough troublesome tests that we can make
a very significant improvement over the next couple months.


It nicely highlights the @Nightly only test results, which tests are
Ignored, BadAppled and AwaitsFixed, and new tests that enter the report as
failing will also be highlighted and can be pushed back against. The
cleaner we get things, the more strict we can try and be.

Right now I’m mainly working on the ugliest tests. There are only a
handful. Once all of the tests are just flakey (fail < 10% out of the 30
runs), I’m going to start pushing on those individual tests harder and will
try and encourage authors of those tests to help.


- Mark
-- 
- Mark
about.me/markrmiller

Reply via email to