Just an FYI on a plan I have to try and improve Solr’s test suite: Many years back, a few changes happened that made getting a handle on Solr’s tests quite difficult.
We started running tests in parallel. We added a random testing framework and mentality. We started running a few Jenkins all the time rather than one or few runs per day. We added a lot of distributed code and tests with timeouts and many complicated interactions. The results have been great on many fronts, but the number of fails produced and the current Jenkins reporting has made the Solr test suite quite hard to get a handle on. It is much too hard to tell even some basic things: what tests are flakey right now? which test fails actually affect devs most (fails that happen even in a clean, well resourced environment)? did I break it? was that test already flakey? is that test still flakey? what are our worst tests right now? is that test getting better or worse? Is it a bad test or just a bad test under low resources? How many tests are flakey? The stream of email fails is easy to ignore and hard to follow. Even if you do follow, the information is an always changing stream that is difficult to summarize. (though we can do things here too, a while back I sent a couple test reports by doing simple regexes on emails and trying to count up test fails). Our nightly tests also get little to no visibility due to all the fails. Sometimes some of those tests are simply broken because changes were made that didn't account for them. We need a way to highlight @Nightly only test fails. I, like many others, have spent a lot of time trying to improve our test stability over the years. But it’s whack a mole and often feels like it’s hard to make real progress on hardening the whole suite. Part of the issue is that Solr has almost 900 tests. If even just 10-20 of them fail 1 out 30 or 1 out of 60 runs, that is going to produce plenty of ‘ant test’ fails when many Jenkins machines are blasting all day. For individual tests, one way we have found that works well for hardening is to ‘beast’ the test. Or run it many times, ideally in parallel. It’s been in my head a long time, but I have finally built upon that strategy and extended it out to a full ‘test run’ of beasting. By beasting every test, I am able to generate a test report based on a single commit point that objectively scores each test. Currently I am doing 30 test runs, 10 at a time. Eventually I’d like to up that to at least 50, but of course I can also just go higher for the now growing list of known flakey tests. We also will have some easy to reference history though. Tests will carry a checkable reputation. Also, unlike many Jenkins failures, reproducing is usually quite simple. Just beast it again. Sometimes you might have to do it more than once or for more runs, but generally it's easy to pop up the fail again or test a fix. So far over the years, I have never hit a fail while beasting that I couldn't hit again, even very rare chaos monkey fails that would take 100-300 runs to hit. Anyway, I’m working on this strategy here: SOLR-10032: Create report to assess Solr test quality at a commit point. I am about to put up my 3rd report, and I’m going to soon start summarizing these reports and pinging the dev list with the results. We will see where it goes, but I think we have few enough troublesome tests that we can make a very significant improvement over the next couple months. It nicely highlights the @Nightly only test results, which tests are Ignored, BadAppled and AwaitsFixed, and new tests that enter the report as failing will also be highlighted and can be pushed back against. The cleaner we get things, the more strict we can try and be. Right now I’m mainly working on the ugliest tests. There are only a handful. Once all of the tests are just flakey (fail < 10% out of the 30 runs), I’m going to start pushing on those individual tests harder and will try and encourage authors of those tests to help. - Mark -- - Mark about.me/markrmiller
