Thanks so much Mark for doing this -- it is so important! The current state is a sad state.
Another thing related to Solr tests I've been thinking about is trying to make tests more self contained instead of needing custom schema or configs, or running the risk of adding yet more info to a shared config. In many (but not all) cases, I believe it's possible to write tests using ManagedSchema (and some techniques I'll share later), to avoid it. Once I finish a feature I'm working on that uses this technique I'll share it more broadly to get public comment. ~ David On Wed, Feb 8, 2017 at 7:02 PM Mark Miller <[email protected]> wrote: > Just an FYI on a plan I have to try and improve Solr’s test suite: > > Many years back, a few changes happened that made getting a handle on > Solr’s tests quite difficult. > > We started running tests in parallel. > > We added a random testing framework and mentality. > > We started running a few Jenkins all the time rather than one or few runs > per day. > > We added a lot of distributed code and tests with timeouts and many > complicated interactions. > > The results have been great on many fronts, but the number of fails > produced and the current Jenkins reporting has made the Solr test suite > quite hard to get a handle on. > > It is much too hard to tell even some basic things: what tests are flakey > right now? which test fails actually affect devs most (fails that happen > even in a clean, well resourced environment)? did I break it? was that test > already flakey? is that test still flakey? what are our worst tests right > now? is that test getting better or worse? Is it a bad test or just a bad > test under low resources? How many tests are flakey? > > The stream of email fails is easy to ignore and hard to follow. Even if > you do follow, the information is an always changing stream that is > difficult to summarize. (though we can do things here too, a while back I > sent a couple test reports by doing simple regexes on emails and trying to > count up test fails). > > > Our nightly tests also get little to no visibility due to all the fails. > Sometimes some of those tests are simply broken because changes were made > that didn't account for them. We need a way to highlight @Nightly only test > fails. > > I, like many others, have spent a lot of time trying to improve our test > stability over the years. But it’s whack a mole and often feels like it’s > hard to make real progress on hardening the whole suite. > > Part of the issue is that Solr has almost 900 tests. If even just 10-20 of > them fail 1 out 30 or 1 out of 60 runs, that is going to produce plenty of > ‘ant test’ fails when many Jenkins machines are blasting all day. > > > For individual tests, one way we have found that works well for hardening > is to ‘beast’ the test. Or run it many times, ideally in parallel. It’s > been in my head a long time, but I have finally built upon that strategy > and extended it out to a full ‘test run’ of beasting. > > By beasting every test, I am able to generate a test report based on a > single commit point that objectively scores each test. Currently I am doing > 30 test runs, 10 at a time. Eventually I’d like to up that to at least 50, > but of course I can also just go higher for the now growing list of known > flakey tests. We also will have some easy to reference history though. > Tests will carry a checkable reputation. > > Also, unlike many Jenkins failures, reproducing is usually quite simple. > Just beast it again. Sometimes you might have to do it more than once or > for more runs, but generally it's easy to pop up the fail again or test a > fix. So far over the years, I have never hit a fail while beasting that I > couldn't hit again, even very rare chaos monkey fails that would take > 100-300 runs to hit. > > Anyway, I’m working on this strategy here: SOLR-10032: Create report to > assess Solr test quality at a commit point. > > I am about to put up my 3rd report, and I’m going to soon start > summarizing these reports and pinging the dev list with the results. We > will see where it goes, but I think we have few enough troublesome tests > that we can make a very significant improvement over the next couple months. > > > It nicely highlights the @Nightly only test results, which tests are > Ignored, BadAppled and AwaitsFixed, and new tests that enter the report > as failing will also be highlighted and can be pushed back against. The > cleaner we get things, the more strict we can try and be. > > Right now I’m mainly working on the ugliest tests. There are only a > handful. Once all of the tests are just flakey (fail < 10% out of the 30 > runs), I’m going to start pushing on those individual tests harder and will > try and encourage authors of those tests to help. > > > - Mark > -- > - Mark > about.me/markrmiller > -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com
