On 4/8/14, 6:51 AM, James Graham wrote:
On 08/04/14 14:43, Andrew Halberstadt wrote:
On 07/04/14 11:49 AM, Aryeh Gregor wrote:
On Mon, Apr 7, 2014 at 6:12 PM, Ted Mielczarek <t...@mielczarek.org>
wrote:
If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.

To some degree, yes, marking a test as expected intermittent causes it
to lose value.  If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so.  That should be their
decision, not something forced on them by infrastructure issues
("everyone else will suffer if you don't find the cause for this
failure in your test").  Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.

I think this proposal would make more sense if the state of our
infrastructure and tooling was able to handle it properly. Right now,
automatically marking known intermittents would cause the test to lose
*all* value. It's sad, but the only data we have about intermittents
comes from the sheriffs manually starring them. There is also currently
no way to mark a test KNOWN-RANDOM and automatically detect if it starts
failing permanently. This means the failures can't be starred and become
nearly impossible to discover, let alone diagnose.

So, what's the minimum level of infrastructure that you think would be
needed to go ahead with this plan? To me it seems like the current
system already isn't working very well, so the bar for moving forward
with a plan that would increase the amount of data we had available to
diagnose problems with intermittents, and reduce the amount of manual
labour needed in marking them, should be quite low.

The simple solution is to have a separate in-tree manifest annotation for intermittents. Put another way, we can describe exactly why we are not running a test. This is kinda/sorta the realm of bug 922581.

The harder solution is to have some service (like orange factor) keep track of the state of every test. We can have a feedback loop whereby test automation queries that service to see what tests should run and what the expected result is. Of course, we will want that integration to work locally so we have consistent test execution between automation and developer machines.

I see us inevitably deploying the harder solution. We'll eventually get to a point where we're able to do "crazy" things such as intelligently run only the sub-set of tests impacted by a check-in or attempting to run disabled tests to see if they magically started working again. I think we'll eventually realize that tracking this in a central service makes more sense than doing it in-tree (mainly because of the amount of data required to make some advanced determinations).

For the short term, I think we should enumerate the reasons we don't run a test (distinguishing between "test isn't compatible" and "test isn't working" is important) and annotate these separately in our test manifests. We can then modify our test automation to treat things differently. For example, we could:

1) Run failed tests multiple times. If it is intermittent but not marked as such, we fail the test run. 2) Run marked intermittent tests multiple times. If it works all 25 times, fail the test run for inconsistent metadata.
3) Integrate intermittent failures into TBPL/Orange Factor better.

To address David Baron's concern about silently passing intermittently failing tests, yes, silently passing is wrong. But I would argue it is the lesser evil of disabling tests outright.

I think we can all agree that the current approach of disabling failing tests (the equivalent of sweeping dust under the rug) isn't sustainable. But if it's the Sherriffs' job to keep the trees green and their only available recourse is to disable tests, well, they are going to disable tests. We need more metadata and tooling around disabled tests and we needed it months ago.
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to