Re: Orange is the new Bad (Gij)

Jonas Sicking Tue, 17 Nov 2015 16:36:17 -0800

Jumping in on an old thread here.

I 100% agree that getting rid of the intermittent failures is really
important. Especially the retry-three-times thing that we are doing.


One really important problem that we need to solve is that we have a
test harness problem which is causing the socket that marionette uses
to sometimes disconnect.

This means that any test can and does fail intermittently. And fairly
often as I understand it.

At the very least we should detect that this is the problem and rerun
the test. (This is ok since the broken socket is a marionette bug and
not a product bug). But even better is of course to find the source of
this disconnect and fix it.

Many have tried to find and fix this problem, but it's hard since it
only reproduces intermittently. One possible approach would be to try
to catch this in rr. I don't think that has been tried yet.

/ Jonas


On Wed, Nov 4, 2015 at 7:39 AM, Michael Henretty <[email protected]> wrote:
> Hi Gaia Folk,
>
> If you've been doing Gaia core work for any length of time, you are probably
> aware that we have *many* intermittent Gij test failures on Treeherder [1].
> But the problem is even worse than you may know! You see, each Gij test is
> run 5 times within a test chunk (g. Gij4) before it is marked as failing.
> Then that chunk itself is retried up to 5 times before the whole thing is
> marked as failing. This means that for a test to be marked as "passing," it
> only has to run successfully once in 25 times. I'm not kidding. Our retry
> logic, especially those inside the test chunk, make it hard to know which
> intermittent tests are our worst offenders. This is bad.
>
> My suggestion is to stop doing the retries inside the chunks. That way, the
> failures will at least surface on Treeherder, which means we can star more
> test, which means we'll have a lot more visibility on the bad intermittents.
> Sheriffs will complain a lot, so we have to be ready to act on these bugs.
> But the alternative is that we continue to write tests with a low "raciness"
> bar which, IMO, have a much lower chance of catching regressions. The longer
> we wait, the worse this problem becomes.
>
> Thoughts?
>
> Thanks,
> Michael
>
> 1.)
> https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-failure&keywords_type=allwords&list_id=12657856&resolution=---&query_format=advanced&product=Firefox%20OS
>
> _______________________________________________
> dev-fxos mailing list
> [email protected]
> https://lists.mozilla.org/listinfo/dev-fxos
>
_______________________________________________
dev-fxos mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-fxos

Re: Orange is the new Bad (Gij)

Reply via email to