Jumping in on an old thread here. I 100% agree that getting rid of the intermittent failures is really important. Especially the retry-three-times thing that we are doing.
One really important problem that we need to solve is that we have a test harness problem which is causing the socket that marionette uses to sometimes disconnect. This means that any test can and does fail intermittently. And fairly often as I understand it. At the very least we should detect that this is the problem and rerun the test. (This is ok since the broken socket is a marionette bug and not a product bug). But even better is of course to find the source of this disconnect and fix it. Many have tried to find and fix this problem, but it's hard since it only reproduces intermittently. One possible approach would be to try to catch this in rr. I don't think that has been tried yet. / Jonas On Wed, Nov 4, 2015 at 7:39 AM, Michael Henretty <[email protected]> wrote: > Hi Gaia Folk, > > If you've been doing Gaia core work for any length of time, you are probably > aware that we have *many* intermittent Gij test failures on Treeherder [1]. > But the problem is even worse than you may know! You see, each Gij test is > run 5 times within a test chunk (g. Gij4) before it is marked as failing. > Then that chunk itself is retried up to 5 times before the whole thing is > marked as failing. This means that for a test to be marked as "passing," it > only has to run successfully once in 25 times. I'm not kidding. Our retry > logic, especially those inside the test chunk, make it hard to know which > intermittent tests are our worst offenders. This is bad. > > My suggestion is to stop doing the retries inside the chunks. That way, the > failures will at least surface on Treeherder, which means we can star more > test, which means we'll have a lot more visibility on the bad intermittents. > Sheriffs will complain a lot, so we have to be ready to act on these bugs. > But the alternative is that we continue to write tests with a low "raciness" > bar which, IMO, have a much lower chance of catching regressions. The longer > we wait, the worse this problem becomes. > > Thoughts? > > Thanks, > Michael > > 1.) > https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-failure&keywords_type=allwords&list_id=12657856&resolution=---&query_format=advanced&product=Firefox%20OS > > _______________________________________________ > dev-fxos mailing list > [email protected] > https://lists.mozilla.org/listinfo/dev-fxos > _______________________________________________ dev-fxos mailing list [email protected] https://lists.mozilla.org/listinfo/dev-fxos

