On 2014-04-08, 8:15 AM, Aryeh Gregor wrote:
On Tue, Apr 8, 2014 at 2:41 AM, Ehsan Akhgari <ehsan.akhg...@gmail.com> wrote:
What you're saying above is true *if* someone investigates the intermittent
test failure and determines that the bug is not important.  But in my
experience, that's not what happens at all.  I think many people treat
intermittent test failures as a category of unimportant problems, and
therefore some bugs are never investigated.  The fact of the matter is that
most of these bugs are bugs in our tests, which of course will not impact
our users directly, but I have occasionally come across bugs in our code
code which are exposed as intermittent failures.  The real issue is that the
work of identifying where the root of the problem is often time is the
majority of work needed to fix the intermittent test failure, so unless
someone is willing to investigate the bug we cannot say whether or not it
impacts our users.

The same is true for many bugs.  The reported symptom might indicate a
much more extensive underlying problem.  The fact is, though,
thoroughly investigating every bug would take a ton of resources, and
is almost certainly not the best use of our manpower.  There are many
bugs that are *known* to affect many users that don't get fixed in a
timely fashion.  Things that probably won't affect a single user ever
at all, and which are likely to be a pain to track down (because
they're intermittent), should be prioritized relatively low.

I don't think that an analogy with normal bugs is accurate here. These intermittent failure bugs are categorically treated differently than all other incoming bugs in my experience.

The thing that really makes me care about these intermittent failures a lot
is that ultimately they make us have to trade either disabling a whole bunch
of tests with being unable to manage our tree.  As more and more tests get
disabled, we lose more and more test coverage, and that can have a much more
severe impact on the health of our products than every individual
intermittent test failure.

I think you hit the nail on the head, but I think there's a third
solution: automatically ignore known intermittent failures, in as
fine-grained a way as possible.  This means the test is still almost
as useful -- I think the vast majority of our tests will fail
consistently if the thing they're testing breaks, not fail
intermittently.  But it doesn't get in the way of managing the tree.
Yes, it reduces some tests' value slightly relative to fixing them,
but it's not a good use of our resources to try tracking down most
intermittent failures.  The status quo reduces those tests' value just
as much as automatic ignoring (because people will star known failure
patterns consistently), but imposes a large manual labor cost.

I agree that automatically ignoring known intermittent failures that have been marked as such by a human is a good idea.

But let's also not forget that it won't be a one size fits all solution. There are test failure scenarios such as timeouts and crashes which we can't easily retry (for timeouts because the test may leave its environment in a non-clean state). There will also be cases where reloading the same test will actually test different things (because, for example, things have been cached, etc.).

Cheers,
Ehsan
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to