Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-23 Thread Kartikaya Gupta
On Thu, Mar 9, 2017 at 10:31 AM, Kartikaya Gupta  wrote:
> On Wed, Mar 8, 2017 at 6:02 PM, L. David Baron  wrote:
>> As of 5 days ago, "Treeherder Bug Filer" was not using BUG_COMPONENT
>> information.  I say this based on:
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1344304
>> being filed in Core :: Layout despite:
>>   > $ ./mach file-info bugzilla-component 
>> layout/style/test/test_compute_data_with_start_struct.html
>>   > Core :: CSS Parsing and Computation
>>   >   layout/style/test/test_compute_data_with_start_struct.html
>>
>> (I wish it did use BUG_COMPONENT!  That's the main reason I bothered
>> to write good BUG_COMPONENT data for most of layout/*.)
>>
>
> Thanks for letting me know! I verified that indeed the bug filer
> doesn't check the BUG_COMPONENT. I'll look into fixing that.

FYI it looks like this has since been fixed by Wes, in bug 1347764.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-09 Thread Karl Tomlinson
> It'd make me feel slightly less sad that we're disabling tests
> that do their job 90% of the time...

The way I interpret a test failing 10% of the time is that either
it has already done its job to indicate a problem in the product,
or the test is not doing its job.

Either way, if it is not going to be actively addressed, then the
value in continuing to run the test is questionable.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-09 Thread James Graham

On 09/03/17 19:53, Milan Sreckovic wrote:

Not a reply to this message, just continuing the thread.

I'd like to see us run all the intermittently disabled tests once a ...
week, say, or at some non-zero frequency, and automatically re-enable
the tests that magically get better.  I have a feeling that some
intermittent failures get fixed without us realizing it (e.g., we reduce
memory usage, so OOMs go away, we speed things up so timeouts stop
triggering) and it would be good to be able to re-enable those tests
that start passing.

It'd make me feel slightly less sad that we're disabling tests that do
their job 90% of the time...


This idea is appealing, but there are some tricky details. Tests may 
work fine when run in isolation but either cause problems, or be 
problematic, when run in conjunction with related tests. Obviously it's 
possible to deal with that situation, but it might be the difference 
between "and now run this script that automatically enables all the 
tests that were stable over  repetitions" and "the 
results of this need careful manual analysis and are likely to result in 
tests that flip-flop between enabled and disabled a few times before 
people wise up to their peculiar brokenness".


I'm not saying that it's impossible, but I do doubt it's as trivial as 
it first appears.

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-09 Thread Milan Sreckovic

Not a reply to this message, just continuing the thread.

I'd like to see us run all the intermittently disabled tests once a ... 
week, say, or at some non-zero frequency, and automatically re-enable 
the tests that magically get better.  I have a feeling that some 
intermittent failures get fixed without us realizing it (e.g., we reduce 
memory usage, so OOMs go away, we speed things up so timeouts stop 
triggering) and it would be good to be able to re-enable those tests 
that start passing.


It'd make me feel slightly less sad that we're disabling tests that do 
their job 90% of the time...



On 09-Mar-17 14:04, jma...@mozilla.com wrote:

A lot of great discussion here, thanks everyone for taking some time out of 
your day to weigh in on this subject.  There are slight differences between a 
bug being filed and actively working on the bug once it crosses our threshold 
of 30 failures/week- I want to discuss when we have looked at the bug and have 
tried to add context/value, including a ni? request.

Let me comment on a few items here:
1) BUG_COMPONENT, I have been working this quarter to get this completed 
in-tree (https://bugzilla.mozilla.org/show_bug.cgi?id=1328351 ).  Ideally the 
sheriffs and Bug Filer tools will use this, we can work to fix that.  Part of 
this is ensuring there is an active triager responsible for those components, 
that is mostly done: 
https://bugzilla.mozilla.org/page.cgi?id=triage_owners.html.

2) how do we get the right people to see the bugs?  We will always ni? the 
triage owner unless we know a better person to send a ni? request to.  In many 
cases we determine a specific patch causes a regression, we will also include 
the patch author with a ni? and reviewer as a cc on the bug in those cases.  
Please watch your components in bugzilla and keep your bugzilla handle updated 
with PTO.

3) to the point of not clearing the ni? on a bug where we disable the test 
case, that is easy to do, lets assume that is standard protocol if we are 
disabling a test (or hacking up the test case)

4) more granular whiteboard tags, and ones that don't use stockwell.  We will 
figure out the right naming, right now it most likely will be extra tags to 
track when we fixed a disabled test as well as differentiating between test 
fixes and product fixes.

5) When we triage a bug (initial investigation after crossing 30 failures/week), we 
will include a brief report of the configuration affected the most along with the 
number of failures, number of runs, and the percentage failure.  This will be 
retrieved by using |mach test-info | (bug 1345572 for more info) and will 
look similar to this:
Total: 307 failures in 4313 runs or 0.071 failures/run
Worst rate on linux32/debug-e10s: 73 failures in 119 runs or 0.613 failures/run

6) using a different metric/threshold for investigating a bug.  We looked at 6 
months of data from 2016 to come up with this number.  Assuming we fixed all of 
the bugs that are high frequency, Orange Factor would still be 4.78 (as of 
Monday) which is still unacceptable, we are only interested in investigating 
tests that have the highest chance of getting fixed or cause the most pain, not 
just whatever is top 10 or relatively high.  My goal is to adjust the threshold 
down to 20 in the future- that might not be as realistic as I would hope in the 
short term.

Keep in mind sheriffs are human, they make mistakes (filing bugs wrong, ni? the 
wrong person, etc.) but they are also flexible and will work with you to help 
get more information or help manage a larger volume of failures and allowing 
extra time if you are actively debugging the problem.

Thanks for the many encouraging comments in this thread and suggestions of how 
to work out the quirks with this new process.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


--
- Milan (mi...@mozilla.com)

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-09 Thread jmaher
A lot of great discussion here, thanks everyone for taking some time out of 
your day to weigh in on this subject.  There are slight differences between a 
bug being filed and actively working on the bug once it crosses our threshold 
of 30 failures/week- I want to discuss when we have looked at the bug and have 
tried to add context/value, including a ni? request.

Let me comment on a few items here:
1) BUG_COMPONENT, I have been working this quarter to get this completed 
in-tree (https://bugzilla.mozilla.org/show_bug.cgi?id=1328351 ).  Ideally the 
sheriffs and Bug Filer tools will use this, we can work to fix that.  Part of 
this is ensuring there is an active triager responsible for those components, 
that is mostly done: 
https://bugzilla.mozilla.org/page.cgi?id=triage_owners.html.

2) how do we get the right people to see the bugs?  We will always ni? the 
triage owner unless we know a better person to send a ni? request to.  In many 
cases we determine a specific patch causes a regression, we will also include 
the patch author with a ni? and reviewer as a cc on the bug in those cases.  
Please watch your components in bugzilla and keep your bugzilla handle updated 
with PTO.

3) to the point of not clearing the ni? on a bug where we disable the test 
case, that is easy to do, lets assume that is standard protocol if we are 
disabling a test (or hacking up the test case)

4) more granular whiteboard tags, and ones that don't use stockwell.  We will 
figure out the right naming, right now it most likely will be extra tags to 
track when we fixed a disabled test as well as differentiating between test 
fixes and product fixes.

5) When we triage a bug (initial investigation after crossing 30 
failures/week), we will include a brief report of the configuration affected 
the most along with the number of failures, number of runs, and the percentage 
failure.  This will be retrieved by using |mach test-info | (bug 1345572 
for more info) and will look similar to this:
Total: 307 failures in 4313 runs or 0.071 failures/run
Worst rate on linux32/debug-e10s: 73 failures in 119 runs or 0.613 failures/run

6) using a different metric/threshold for investigating a bug.  We looked at 6 
months of data from 2016 to come up with this number.  Assuming we fixed all of 
the bugs that are high frequency, Orange Factor would still be 4.78 (as of 
Monday) which is still unacceptable, we are only interested in investigating 
tests that have the highest chance of getting fixed or cause the most pain, not 
just whatever is top 10 or relatively high.  My goal is to adjust the threshold 
down to 20 in the future- that might not be as realistic as I would hope in the 
short term.

Keep in mind sheriffs are human, they make mistakes (filing bugs wrong, ni? the 
wrong person, etc.) but they are also flexible and will work with you to help 
get more information or help manage a larger volume of failures and allowing 
extra time if you are actively debugging the problem.

Thanks for the many encouraging comments in this thread and suggestions of how 
to work out the quirks with this new process.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-09 Thread Kartikaya Gupta
On Wed, Mar 8, 2017 at 6:02 PM, L. David Baron  wrote:
> As of 5 days ago, "Treeherder Bug Filer" was not using BUG_COMPONENT
> information.  I say this based on:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1344304
> being filed in Core :: Layout despite:
>   > $ ./mach file-info bugzilla-component 
> layout/style/test/test_compute_data_with_start_struct.html
>   > Core :: CSS Parsing and Computation
>   >   layout/style/test/test_compute_data_with_start_struct.html
>
> (I wish it did use BUG_COMPONENT!  That's the main reason I bothered
> to write good BUG_COMPONENT data for most of layout/*.)
>

Thanks for letting me know! I verified that indeed the bug filer
doesn't check the BUG_COMPONENT. I'll look into fixing that.

Cheers,
kats
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-08 Thread Karl Tomlinson
I would like to see failure rates expressed as a ratio of failures
to test runs, but I recognise that this data may not be readily
available and getting it may not be that important if we have a
rough idea.  These are a means for setting priorities, and so a
rank works well.

If we have 100 tests, each with an expected failure rate of 3%,
then the chance of a clear run is less than 5%.

3% is not an acceptable failure rate for a single test IMO.

It does get harder to reproduce as the failure rate reduces, but
logging can be added and inspected on the next failure, which will
not be far away.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-08 Thread L. David Baron
On Wednesday 2017-03-08 17:15 -0500, Kartikaya Gupta wrote:
> On Wed, Mar 8, 2017 at 4:01 PM,   wrote:
> > In the past I have not always been made aware when my tests were disabled, 
> > which has lead to me feeling jaded.
> 
> We have a process (in theory) that ensures the relevant people get
> notified of tests. The process involves these steps:
> 1) There is a moz.build file somewhere in the tree which covers the
> tests in question and specifies a BUG_COMPONENT for it
> 2) If a test starts failing intermittently, a bug is filed in the
> aforementioned component
> 3) The component is monitored/triaged regularly by the module owner or team
> 4) Whoever triages the bug notifies the individual owner if they're
> not already on the bug
> 
> So if for some reason anybody feels like they were not notified when
> their tests are being disabled, they should find the link the above
> chain where things broke down (e.g. no BUG_COMPONENT for a test, bug
> filed in a component other than BUG_COMPONENT, nobody triaging new
> bugs, etc.) and do something about it.

As of 5 days ago, "Treeherder Bug Filer" was not using BUG_COMPONENT
information.  I say this based on:
https://bugzilla.mozilla.org/show_bug.cgi?id=1344304
being filed in Core :: Layout despite:
  > $ ./mach file-info bugzilla-component 
layout/style/test/test_compute_data_with_start_struct.html 
  > Core :: CSS Parsing and Computation
  >   layout/style/test/test_compute_data_with_start_struct.html

(I wish it did use BUG_COMPONENT!  That's the main reason I bothered
to write good BUG_COMPONENT data for most of layout/*.)

> >> and it would be reasonable to expect a fix.
> >
> > I think it's unreasonable to assume that developers can drop whatever 
> > they're doing and turn around a fix in a two weeks, given how long these 
> > things often take to fix, and given that developers often have a 
> > pre-existing list of other high priority stuff to work on.
> >
> 
> In my experience it's not so much that a fix is needed in two weeks,
> it's that you need to put in a good-faith effort to respond and start
> investigation. Oftentimes it legitimately takes longer than two weeks
> to fix intermittents, but I've never had a scenario where I asked for
> more time and was denied that.

I think it's often reasonable to expect a *backout* of the cause
within less than two weeks, although perhaps not if the immediate
trigger was changes in test chunking.

-David

-- 
턞   L. David Baron http://dbaron.org/   턂
턢   Mozilla  https://www.mozilla.org/   턂
 Before I built a wall I'd ask to know
 What I was walling in or walling out,
 And to whom I was like to give offense.
   - Robert Frost, Mending Wall (1914)


signature.asc
Description: PGP signature
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-08 Thread Kartikaya Gupta
On Wed, Mar 8, 2017 at 4:01 PM,   wrote:
> In the past I have not always been made aware when my tests were disabled, 
> which has lead to me feeling jaded.

We have a process (in theory) that ensures the relevant people get
notified of tests. The process involves these steps:
1) There is a moz.build file somewhere in the tree which covers the
tests in question and specifies a BUG_COMPONENT for it
2) If a test starts failing intermittently, a bug is filed in the
aforementioned component
3) The component is monitored/triaged regularly by the module owner or team
4) Whoever triages the bug notifies the individual owner if they're
not already on the bug

So if for some reason anybody feels like they were not notified when
their tests are being disabled, they should find the link the above
chain where things broke down (e.g. no BUG_COMPONENT for a test, bug
filed in a component other than BUG_COMPONENT, nobody triaging new
bugs, etc.) and do something about it.

>
>> and it would be reasonable to expect a fix.
>
> I think it's unreasonable to assume that developers can drop whatever they're 
> doing and turn around a fix in a two weeks, given how long these things often 
> take to fix, and given that developers often have a pre-existing list of 
> other high priority stuff to work on.
>

In my experience it's not so much that a fix is needed in two weeks,
it's that you need to put in a good-faith effort to respond and start
investigation. Oftentimes it legitimately takes longer than two weeks
to fix intermittents, but I've never had a scenario where I asked for
more time and was denied that.

>
> I think:
> * Acceptable failure rates as expressed as an absolute number aren't 
> meaningful; we should be expressing acceptable rates as a percentage.

On the one hand, I agree with your rationale here. On the other hand,
the people who have to deal with the problem (sheriffs) have to deal
with it linearly. i.e. 60 failures is twice as much work for them than
30 failures, regardless of the number of pushes. So from their
perspective an absolute number does make sense. I don't have a strong
opinion on this either way, just pointing out the other side.

Cheers,
kats
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-08 Thread chris . ryan . pearce
On Wednesday, March 8, 2017 at 11:18:03 PM UTC+13, jma...@mozilla.com wrote:
> On Tuesday, March 7, 2017 at 11:45:38 PM UTC-5, Chris Pearce wrote:
> > I recommend that instead of classifying intermittents as tests which fail > 
> > 30 times per week, to instead classify tests that fail more than some 
> > threshold percent as intermittent. Otherwise on a week with lots of 
> > checkins, a test which isn't actually a problem could clear the threshold 
> > and cause unnecessary work for orange triage people and developers alike.
> > 
> > The currently published threshold is 8%:
> > 
> > https://wiki.mozilla.org/Sheriffing/Test_Disabling_Policy#Identifying_problematic_tests
> > 
> > 8% seems reasonable to me.
> > 
> > Also, whenever a test is disabled, not only should a bug be filed, but 
> > please _please_ need-info the test owner or at least someone on the 
> > affected team.
> > 
> > If a test for a feature is disabled without the maintainer of that feature 
> > knowing, then we are flying blind and we are putting the quality of our 
> > product at risk.
> > 
> > 
> > cpearce.
> >
> 
> Thanks cpearce for the concern here.  Regarding disabling tests, all tests 
> that we have disabled as part of the stockwell project have started out with 
> a triage where we ni the responsible party and the bug is filed in the 
> component where the test is associated with.  I assume if the bug is filed in 
> the right component others from the team will be made aware of this.  Right 
> now I assume the triage owner of a component is the owner of the tests and 
> can proxy the request to the correct person on the team (many times the 
> original author is on PTO, busy with a project, left the team, etc.).  Please 
> let me know if this is a false assumption and what we could do to better get 
> bugs in front of the right people.


In the past I have not always been made aware when my tests were disabled, 
which has lead to me feeling jaded.


> I agree 8% is a good number, the sheriff policy has other criteria (top 20 on 
> orange factor, 100 times/month).

Ok. Let's assume 8% is a reasonable threshold then...


> We picked 30 times/week as that is where bugs start becoming frequent enough 
> to easily reproduce (locally or on try)

I disagree.

I often find that oranges I get pinged on are in fact not easy to reproduce, 
and it takes a few weeks of elapsed time to solve them due to them typically 
only reproducing on Try, and me needing to work on other high priority bugs 
concurrently. Which means there's a context switch overhead too as I balance 
everything I'm working on.


> and it would be reasonable to expect a fix.

I think it's unreasonable to assume that developers can drop whatever they're 
doing and turn around a fix in a two weeks, given how long these things often 
take to fix, and given that developers often have a pre-existing list of other 
high priority stuff to work on.


>  There is ambiguity when using a %, on a low volume week (as most of december 
> was) we see <500 pushes/week, also the % doesn't indicate the amount of times 
> the test was run- this is affected by SETA (reducing tests for 4/5 commits to 
> save on load) and by people doing retriggers/backfills.  If last week the 
> test was 8%, and it is 7% this week- do we ignore it?
> 
> Picking a single number like 30 times/7days removes ambiguity and ensures 
> that we can stay focused on things and don't have to worry about 
> recalculations.  It is true on lower volume weeks that 30 times/7days doesn't 
> happen as frequently, yet we have always had many bugs to work on with that 
> threshold.

It sounds like the problem actually is that we haven't taken the time to 
implement good data collection.

Given your example of about 500 pushes/week 30 in December, 30 failures in 500 
pushes is a 6% failure rate, well below the 8% rate the sheriffs beholden to 
enforce.

So given that you say that an 8% threshold is reasonable, then 30 failures/week 
is already a too low threshold.

If we saw 30 failures in a 1000 pushes/week, that would be a 3% failure rate, 
but by your reasoning that would be considered worthy of investigation.

I don't think it's reasonable to consider a test failing 3% of the time as 
worthy of investigation.

To me, it feels like we're setting ourselves up for creating unnecessary 
crises, and unnecessary tension between the stockwell people and the 
developers. 

I think:
* Acceptable failure rates as expressed as an absolute number aren't 
meaningful; we should be expressing acceptable rates as a percentage.
* Two weeks is simply an unreasonable amount of time in which to expect a fix 
for an intermittent.


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-08 Thread Marco Bonardo
On Wed, Mar 8, 2017 at 2:12 PM, Kartikaya Gupta  wrote:

> What makes me sad is all the
> developers in this thread trying to push back against disabling of
> clearly problematic tests, asking for things like tracking bugs and
> needinfos, when the reality is that the developers just don't have the
> time to deal with new bugs or needinfos.


I sort of disagree with the sentiment. We're all well aware of the great
job sheriffs and people like Joel are doing. I also did sheriffing for
quite some time in spare time in the past, even out of work times (on
saturdays and sundays) and I partially know what I'm talking about. I'd
thank them every minute.


> I think we have a structural problem where developers are insulated
> from the actual fallout of bad tests.


That was one of the reasons to introduce dedicated sheriffs and fast
backouts indeed, so developers can spend less time looking at the tree and
more coding.


> But I feel fundamentally the problem is that
> developers have no real incentive (except for "pride") to fix
> intermittents. Even disabling tests is not really an incentive, as
> there is no negative effect to the developer when it happens.


I feel quite bad if one of the tests in the modules I own gets disabled,
that's has quite a negative effect on me. But I agree there is no incentive
on fixing intermittents, or better, there is no dedicated resources for
that (more later).


> The
> naive solution to aligning incentives is to make more developers
> responsible for starring failures.


We know this didn't work already.


> Another potential solution is block developers until they fix their
> tests.


The first problem I see is to figure out who "owns" the failing test. I
have tests in the modules I own that have some intermittent failures (not
so frequent luckily), but those modules have no team to fix those. Should I
be blocked until I fix all the intermittent tests in my modules? Who should
be? How do I find developers with time to help me fix those failures? I
have no idea.
Also, sounds like an expensive trade-off to suggest to management, if that
person is working on a critical project for the quarter, blocking him to
work on an intermittent that may be tricky to solve and take days, will
have a huge cost.

Btw, my opinion here is that the situation will never improve until there's
a general recognition that intermittents have a cost and thus teams should
dedicate part of their time to them. So far the burden is put on
individuals, that try to volunteer time between planned projects to fix
tests they know something about. But no team, afaik, has a dedicated
planning for these issues.
Why is triaging and prioritizing intermittent failueres fixes not
officially part of every team's weekly planning?
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-08 Thread Kartikaya Gupta
I've been reading this thread with much sadness, but refraining from
commenting because I have nothing good to say. But I feel like I
should probably comment regardless. What makes me sad is all the
developers in this thread trying to push back against disabling of
clearly problematic tests, asking for things like tracking bugs and
needinfos, when the reality is that the developers just don't have the
time to deal with new bugs or needinfos. After all, the test was
disabled only *after* bugs were filed and people were needinfo'd with
no response. In my experience, Joel and the sheriffing team put in a
lot of hard work starring these things and filing bugs, only to have a
lot of that work ignored. Except of course when it comes to threads
like this, when everybody has an opinion, mostly asking for even more
of them. (I am aware that many developers do actually investigate and
fix intermittents, and that set of developers is orthogonal of the set
of developers pushing back in this thread, but it still disheartens
me).

I think we have a structural problem where developers are insulated
from the actual fallout of bad tests. They're not the ones who have to
deal with starring failures. For a while now I've been doing this work
on the graphics branch, and it's not fun at all. There are so many
failures (and variations of failures) on a typical push that it
becomes very easy to "zone out" and just pick a sort-of-matches
existing failure and star a new failure with it. This leads to
mis-starring of stuff and missing new legitimate failures. I'm sure
the sheriffs do a better job than I do, but even so the trend is clear
to me: more unsolved intermittents will result in more bad data across
the board, and more missed legitimate failures. (Think of it like
fixing warnings in code - if you already have reams of warnings you're
easily going to miss a new one being added).

Sure, there's tooling we can improve. And sure, we can tweak the way
we process these things. And there are some good suggestions upthread
that we should do. But I feel fundamentally the problem is that
developers have no real incentive (except for "pride") to fix
intermittents. Even disabling tests is not really an incentive, as
there is no negative effect to the developer when it happens. The
naive solution to aligning incentives is to make more developers
responsible for starring failures. I believe this was done in the
past, but I don't really want to go back there. For one thing, we
would lose a valuable across-the-board view of common types of similar
errors, and sheriffs do a lot of non-obvious starring as well
(sometimes the relevant error message doesn't show up in the TH view)
or recall bugs from memory which is hard for other people to do.

Another potential solution is block developers until they fix their
tests. This is actually not as crazy as it sounds. After all, when
landing code, reviewers can (and should) block developers from landing
until they have tests that go with the code. So why is it that the bar
is lowered later, and the test is allowed to be disabled without
backing out or disabling the corresponding code? (Rhetorical
question). It's not easy to back out or disable code, but it's much
easier to prevent a developer from landing new code if they have a
backlog of intermittents or disabled tests. But again, this is also a
bad solution for various reasons.

I've been around long enough to realize that there's no real good
solution here. And that makes me sad. But I also do want to really
thank Joel and the sheriffs and anybody else who has to deal with
pointy end of this problem for doing the great job that they do, given
the circumstances. And I'd like to encourage developers to try and
take more responsibility for intermittent failures, even in the
absence of adequate tooling and STR and all that other good stuff. At
the very least respond to your needinfos.

Cheers,
kats


On Wed, Mar 8, 2017 at 5:18 AM,   wrote:
> On Tuesday, March 7, 2017 at 11:45:38 PM UTC-5, Chris Pearce wrote:
>> I recommend that instead of classifying intermittents as tests which fail > 
>> 30 times per week, to instead classify tests that fail more than some 
>> threshold percent as intermittent. Otherwise on a week with lots of 
>> checkins, a test which isn't actually a problem could clear the threshold 
>> and cause unnecessary work for orange triage people and developers alike.
>>
>> The currently published threshold is 8%:
>>
>> https://wiki.mozilla.org/Sheriffing/Test_Disabling_Policy#Identifying_problematic_tests
>>
>> 8% seems reasonable to me.
>>
>> Also, whenever a test is disabled, not only should a bug be filed, but 
>> please _please_ need-info the test owner or at least someone on the affected 
>> team.
>>
>> If a test for a feature is disabled without the maintainer of that feature 
>> knowing, then we are flying blind and we are putting the quality of our 
>> product at risk.
>>
>>
>> cpearce.
>>
>
> Thanks 

Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-08 Thread jmaher
On Tuesday, March 7, 2017 at 11:45:38 PM UTC-5, Chris Pearce wrote:
> I recommend that instead of classifying intermittents as tests which fail > 
> 30 times per week, to instead classify tests that fail more than some 
> threshold percent as intermittent. Otherwise on a week with lots of checkins, 
> a test which isn't actually a problem could clear the threshold and cause 
> unnecessary work for orange triage people and developers alike.
> 
> The currently published threshold is 8%:
> 
> https://wiki.mozilla.org/Sheriffing/Test_Disabling_Policy#Identifying_problematic_tests
> 
> 8% seems reasonable to me.
> 
> Also, whenever a test is disabled, not only should a bug be filed, but please 
> _please_ need-info the test owner or at least someone on the affected team.
> 
> If a test for a feature is disabled without the maintainer of that feature 
> knowing, then we are flying blind and we are putting the quality of our 
> product at risk.
> 
> 
> cpearce.
>

Thanks cpearce for the concern here.  Regarding disabling tests, all tests that 
we have disabled as part of the stockwell project have started out with a 
triage where we ni the responsible party and the bug is filed in the component 
where the test is associated with.  I assume if the bug is filed in the right 
component others from the team will be made aware of this.  Right now I assume 
the triage owner of a component is the owner of the tests and can proxy the 
request to the correct person on the team (many times the original author is on 
PTO, busy with a project, left the team, etc.).  Please let me know if this is 
a false assumption and what we could do to better get bugs in front of the 
right people.

I agree 8% is a good number, the sheriff policy has other criteria (top 20 on 
orange factor, 100 times/month).  We picked 30 times/week as that is where bugs 
start becoming frequent enough to easily reproduce (locally or on try) and it 
would be reasonable to expect a fix.  There is ambiguity when using a %, on a 
low volume week (as most of december was) we see <500 pushes/week, also the % 
doesn't indicate the amount of times the test was run- this is affected by SETA 
(reducing tests for 4/5 commits to save on load) and by people doing 
retriggers/backfills.  If last week the test was 8%, and it is 7% this week- do 
we ignore it?

Picking a single number like 30 times/7days removes ambiguity and ensures that 
we can stay focused on things and don't have to worry about recalculations.  It 
is true on lower volume weeks that 30 times/7days doesn't happen as frequently, 
yet we have always had many bugs to work on with that threshold.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread Chris Pearce
I recommend that instead of classifying intermittents as tests which fail > 30 
times per week, to instead classify tests that fail more than some threshold 
percent as intermittent. Otherwise on a week with lots of checkins, a test 
which isn't actually a problem could clear the threshold and cause unnecessary 
work for orange triage people and developers alike.

The currently published threshold is 8%:

https://wiki.mozilla.org/Sheriffing/Test_Disabling_Policy#Identifying_problematic_tests

8% seems reasonable to me.

Also, whenever a test is disabled, not only should a bug be filed, but please 
_please_ need-info the test owner or at least someone on the affected team.

If a test for a feature is disabled without the maintainer of that feature 
knowing, then we are flying blind and we are putting the quality of our product 
at risk.


cpearce.



On Wednesday, March 8, 2017 at 9:06:46 AM UTC+13, jma...@mozilla.com wrote:
> On Tuesday, March 7, 2017 at 2:57:21 PM UTC-5, Steve Fink wrote:
> > On 03/07/2017 11:34 AM, Joel Maher wrote:
> > > Good suggestion here- I have seen so many cases where a simple
> > > fix/disabled/unknown/needswork just do not describe it.  Let me work on a
> > > few new tags given that we have 248 bugs to date.
> > >
> > > I am thinking maybe [stockwell turnedoff] - where the job is turned off- 
> > > we
> > > could also ensure one of the last comments indicates this.
> > >
> > > also [stockwell fix] -> [stockwell testfix], [stockwell bandaid] (for 
> > > those
> > > requestLongerTimeouts(), etc.), [stockwell productfix], and [stockwell
> > > reneabled].
> > 
> > Forgive the bikeshedding, but my kneejerk reaction to these is to wonder 
> > whether it's a good idea to use the "stockwell" jargon. It would be a 
> > lot easier for people unfamiliar with the stockwell project if these 
> > were [intermittent turnedoff], [intermittent fix], etc. Perhaps it's too 
> > late, but is that a possibility?
> 
> I think that is valid, thanks for bringing that up!

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread jmaher
On Tuesday, March 7, 2017 at 2:57:21 PM UTC-5, Steve Fink wrote:
> On 03/07/2017 11:34 AM, Joel Maher wrote:
> > Good suggestion here- I have seen so many cases where a simple
> > fix/disabled/unknown/needswork just do not describe it.  Let me work on a
> > few new tags given that we have 248 bugs to date.
> >
> > I am thinking maybe [stockwell turnedoff] - where the job is turned off- we
> > could also ensure one of the last comments indicates this.
> >
> > also [stockwell fix] -> [stockwell testfix], [stockwell bandaid] (for those
> > requestLongerTimeouts(), etc.), [stockwell productfix], and [stockwell
> > reneabled].
> 
> Forgive the bikeshedding, but my kneejerk reaction to these is to wonder 
> whether it's a good idea to use the "stockwell" jargon. It would be a 
> lot easier for people unfamiliar with the stockwell project if these 
> were [intermittent turnedoff], [intermittent fix], etc. Perhaps it's too 
> late, but is that a possibility?

I think that is valid, thanks for bringing that up!
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread Kris Maglione

On Tue, Mar 07, 2017 at 11:15:54AM -0800, Kris Maglione wrote:
It would be nice if, rather than disabling the test, we could just 
annotate so that it would still run, and show up in Orange Factor, but 
wouldn't turn the job orange.


Which might be as simple as moving those jobs into a particular 
subsuite, and running that subsuite as tier 2 or 3.

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread Joel Maher
Good suggestion here- I have seen so many cases where a simple
fix/disabled/unknown/needswork just do not describe it.  Let me work on a
few new tags given that we have 248 bugs to date.

I am thinking maybe [stockwell turnedoff] - where the job is turned off- we
could also ensure one of the last comments indicates this.

also [stockwell fix] -> [stockwell testfix], [stockwell bandaid] (for those
requestLongerTimeouts(), etc.), [stockwell productfix], and [stockwell
reneabled].

we have [stockwell unknown] for tests that stop failing as frequently, this
could be: [stockwell reduced], [stockwell wrongbug], [stockwell unknown] <-
for real unknown

Just tracking what we disable vs fix is a big step towards keeping track of
this problem.  I know there are other tests which are disabled outside of
the view of stockwell, but we catch most of them.  In fact, there have been
cases where we have waited a few weeks and finally had a patch r+ to
disable and waiting one more day and in that day we saw a fix land
(including one today!)  I would like to think we are patient- making sure
that we consider a longer timeout is the type of data I want to hear-
possibly there are other things we could do to help narrow down or bandaid
the test along.

-Joel


On Tue, Mar 7, 2017 at 2:23 PM, Marco Bonardo  wrote:

> On Tue, Mar 7, 2017 at 8:11 PM,  wrote:
>
>> Thanks for checking up on this- there are 6 specific bugs that have this
>> signature in the disabled set- in this case they are all linux32-debug
>> devtools tests- we disabled devtools on linux32-debug because the runtime
>> was exceeding in many cases 90 seconds for a test
>>
>
> Disabling  on a single platform sounds OK, but how I can tell from just
> the [stockwell disabled] annotation? It may be important to distinguish
> the case where it's globally disabled from the case where it's only
> disabled on a single case.
> Maybe the whiteboard annotation could be expanded a little bit, so we
> don't have to read the whole bug to figure that out.
> Btw I didn't want to sound accusing, I'm very glad and happy you are
> looking into this.
>
>
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread Marco Bonardo
On Tue, Mar 7, 2017 at 8:11 PM,  wrote:

> Thanks for checking up on this- there are 6 specific bugs that have this
> signature in the disabled set- in this case they are all linux32-debug
> devtools tests- we disabled devtools on linux32-debug because the runtime
> was exceeding in many cases 90 seconds for a test
>

Disabling  on a single platform sounds OK, but how I can tell from
just the [stockwell
disabled] annotation? It may be important to distinguish the case where
it's globally disabled from the case where it's only disabled on a single
case.
Maybe the whiteboard annotation could be expanded a little bit, so we don't
have to read the whole bug to figure that out.
Btw I didn't want to sound accusing, I'm very glad and happy you are
looking into this.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread jmaher
On Tuesday, March 7, 2017 at 1:59:14 PM UTC-5, Steve Fink wrote:
> Is there a mechanism in place to detect when disabled intermittent tests 
> have been fixed?
> 
> eg, every so often you could rerun disabled tests individually a bunch 
> of times. Or if you can distinguish which tests are failing, run them 
> all a bunch of times and pick apart the wreckage to see which ones are 
> now consistently passing. I'm not suggesting those, just using them as 
> example solutions to illustrate what I mean.
> 
> On 03/07/2017 10:33 AM, Honza Bambas wrote:
> > I presume that when a test is disabled a bug is filed and triaged 
> > within the responsible team as any regular bug.  Only that way we 
> > don't forget and push on fixing it and returning back to the wheel.
> >
> > Are there also some data or stats how often tests having a strong 
> > orange factor catch actual regressions?  I.e. fail a different way 
> > than the filed "intermittent" one and uncover a real bug leading to a 
> > patch back out or filing of a regular functionality regression bug.  
> > If that number is found to be high(ish) for a test, the priority of 
> > fixing it after its disabling should be raised.
> >
> > -hb-
> >
> >

I am happy to see the discussion here.  Overall, we do not have data to 
indicate whether we are fixing a bug in the product patching a test.  I agree 
we should track that and I will try to do that going forward.  I recall 1 case 
of that happening this quarter, I suspect there are others.

Most of the disabled tests are on bugs marked leave-open and have the relevant 
developers on the bug- what value would a new bug bring?  If it would be 
better, i am happy to create a new bug.

I have seen 1 bug get fixed after being disabled, but that is it for this 
quarter.  Possibly there are others, but it is hard to know.  If we follow the 
tree rules for visibility, many of the jobs would be hidden and we would have 
no value from them.  

I think running the tests that are disabled on try once in a while seems 
useful, one could argue that is the role of the people who own the test- 
possibly we could make it easier to do this?  I could see adding a |tag = 
disabled| and running the disabled tests x20 or something in a nightly M(d) job 
or something to indicate it is disabled.  If we were to do that, who would look 
at the results and how would we get that information to all of the teams who 
care about the tests?
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread Kris Maglione

On Tue, Mar 07, 2017 at 06:26:28AM -0800, jma...@mozilla.com wrote:
In March, we want to find a way to disable the teststhat are 
causing the most pain or are most likely not to be fixed, 
without unduly jeopardizing the chance that these bugs will be 
fixed.  We propose:


1) all high frequency (>=30/week) intermittent failure bugs 
will have 2 weeks from initial triage to get fixed, otherwise 
we will disable the test case.


2) all very high frequency bugs (>=75/week) will have 1 week 
from initial triage to get fixed, otherwise we will disable the 
test case.


It would be nice if, rather than disabling the test, we could 
just annotate so that it would still run, and show up in Orange 
Factor, but wouldn't turn the job orange. And make sure someone 
is CCed on the bug to get the daily/weekly nag emails.

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread jmaher
On Tuesday, March 7, 2017 at 1:53:48 PM UTC-5, Marco Bonardo wrote:
> On Tue, Mar 7, 2017 at 6:42 PM, Joel Maher  wrote:
> 
> > Thank for pointing that out.  In some cases we have fixed tests that are
> > just timing out, in a few cases we disable because the test typically runs
> > much faster (i.e. <15 seconds) and is hanging/timing out.  In other cases
> > extending the timeout doesn't help (i.e. a hang/timeout).
> >
> 
> Any failure like "This test exceeded the timeout threshold. It should be
> rewritten or split up. If that's not possible, use requestLongerTimeout(N),
> but only as a last resort" is not a failure nor a timeout.
> For these cases extending the timeout will 100% solve the failure, but it
> can't be considered a long term fix since we should not have single tests
> so complex to take minutes to run.

Thanks for checking up on this- there are 6 specific bugs that have this 
signature in the disabled set- in this case they are all linux32-debug devtools 
tests- we disabled devtools on linux32-debug because the runtime was exceeding 
in many cases 90 seconds for a test (that is after adding requestLongerTimeout) 
where on linux64-debug it was <30 seconds.  There were almost no users on 
linux32-debug for devtools, so we found it easier to disable the entire suite 
of tests to save runtime, sheriff time, etc.  This was done in bug 1328915.

I would encourage you to look through the [stockwell fixed] whiteboard tag and 
see many examples of great fixes by many developers.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread Boris Zbarsky

On 3/7/17 1:33 PM, Honza Bambas wrote:

I presume that when a test is disabled a bug is filed


As far as I can tell, that's not the case...

If that were the case, that would be a good start, yes.

-Boris
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread Steve Fink
Is there a mechanism in place to detect when disabled intermittent tests 
have been fixed?


eg, every so often you could rerun disabled tests individually a bunch 
of times. Or if you can distinguish which tests are failing, run them 
all a bunch of times and pick apart the wreckage to see which ones are 
now consistently passing. I'm not suggesting those, just using them as 
example solutions to illustrate what I mean.


On 03/07/2017 10:33 AM, Honza Bambas wrote:
I presume that when a test is disabled a bug is filed and triaged 
within the responsible team as any regular bug.  Only that way we 
don't forget and push on fixing it and returning back to the wheel.


Are there also some data or stats how often tests having a strong 
orange factor catch actual regressions?  I.e. fail a different way 
than the filed "intermittent" one and uncover a real bug leading to a 
patch back out or filing of a regular functionality regression bug.  
If that number is found to be high(ish) for a test, the priority of 
fixing it after its disabling should be raised.


-hb-


On 07/03/2017 17:10, jma...@mozilla.com wrote:

On Tuesday, March 7, 2017 at 10:37:12 AM UTC-5, Boris Zbarsky wrote:

On 3/7/17 9:26 AM, jma...@mozilla.com wrote:

We find that we are fixing 35% of the bugs and disabling 23% of them.

Is there a credible plan for reenabling the ones we disable?

-Boris
Great question, we do not have a credible plan, but we will have a 
quick way to see what is disabled:

https://goo.gl/EfDKxY

I would like to build a dashboard when possible to outline per 
component which test cases exist, which are disabled, and what 
related intermittent bugs there are.  Part of annotating moz.build 
files with BUG_COMPONENTS is to make it easier to associate all test 
cases with components whoch would help with a dashboard.


Do you have suggestions for how to ensure we keep up with the 
disabled tests?

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform



___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform



___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread Marco Bonardo
On Tue, Mar 7, 2017 at 6:42 PM, Joel Maher  wrote:

> Thank for pointing that out.  In some cases we have fixed tests that are
> just timing out, in a few cases we disable because the test typically runs
> much faster (i.e. <15 seconds) and is hanging/timing out.  In other cases
> extending the timeout doesn't help (i.e. a hang/timeout).
>

Any failure like "This test exceeded the timeout threshold. It should be
rewritten or split up. If that's not possible, use requestLongerTimeout(N),
but only as a last resort" is not a failure nor a timeout.
For these cases extending the timeout will 100% solve the failure, but it
can't be considered a long term fix since we should not have single tests
so complex to take minutes to run.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread Honza Bambas
I presume that when a test is disabled a bug is filed and triaged within 
the responsible team as any regular bug.  Only that way we don't forget 
and push on fixing it and returning back to the wheel.


Are there also some data or stats how often tests having a strong orange 
factor catch actual regressions?  I.e. fail a different way than the 
filed "intermittent" one and uncover a real bug leading to a patch back 
out or filing of a regular functionality regression bug.  If that number 
is found to be high(ish) for a test, the priority of fixing it after its 
disabling should be raised.


-hb-


On 07/03/2017 17:10, jma...@mozilla.com wrote:

On Tuesday, March 7, 2017 at 10:37:12 AM UTC-5, Boris Zbarsky wrote:

On 3/7/17 9:26 AM, jma...@mozilla.com wrote:

We find that we are fixing 35% of the bugs and disabling 23% of them.

Is there a credible plan for reenabling the ones we disable?

-Boris

Great question, we do not have a credible plan, but we will have a quick way to 
see what is disabled:
https://goo.gl/EfDKxY

I would like to build a dashboard when possible to outline per component which 
test cases exist, which are disabled, and what related intermittent bugs there 
are.  Part of annotating moz.build files with BUG_COMPONENTS is to make it 
easier to associate all test cases with components whoch would help with a 
dashboard.

Do you have suggestions for how to ensure we keep up with the disabled tests?
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform



___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread Joel Maher
Thank for pointing that out.  In some cases we have fixed tests that are
just timing out, in a few cases we disable because the test typically runs
much faster (i.e. <15 seconds) and is hanging/timing out.  In other cases
extending the timeout doesn't help (i.e. a hang/timeout).

Please feel free to comment on any of the disabled bugs, I am happy to take
another look!

On Tue, Mar 7, 2017 at 12:38 PM, Marco Bonardo  wrote:

> On Tue, Mar 7, 2017 at 6:34 PM, Marco Bonardo
>>
>> In case of mochitest browser tests failing on "This test exceeded the
>> timeout threshold", the temporary solution after 1 or 2 weeks should be to
>> add requestLongertimeout,rather than disabling them. They should still be
>> split up into smaller tests, but doesn't make sense to disable them since
>> they can complete properly.
>>
>
> And indeed on the list you provided I see 5 of the tests are wrongly
> disabled, they are not failing they just take too much time to complete.
> They should be "adjusted" but still enabled and tracked by your project as
> "flaky" or whatever.
>
>
>
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread Marco Bonardo
On Tue, Mar 7, 2017 at 6:34 PM, Marco Bonardo
>
> In case of mochitest browser tests failing on "This test exceeded the
> timeout threshold", the temporary solution after 1 or 2 weeks should be to
> add requestLongertimeout,rather than disabling them. They should still be
> split up into smaller tests, but doesn't make sense to disable them since
> they can complete properly.
>

And indeed on the list you provided I see 5 of the tests are wrongly
disabled, they are not failing they just take too much time to complete.
They should be "adjusted" but still enabled and tracked by your project as
"flaky" or whatever.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread Marco Bonardo
On Tue, Mar 7, 2017 at 3:26 PM,  wrote:

> In recent months we have been triaging high frequency (>=30 times/week)
> failures in automated tests.  We find that we are fixing 35% of the bugs
> and disabling 23% of them.
>

In case of mochitest browser tests failing on "This test exceeded the
timeout threshold", the temporary solution after 1 or 2 weeks should be to
add requestLongertimeout,rather than disabling them. They should still be
split up into smaller tests, but doesn't make sense to disable them since
they can complete properly.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread Boris Zbarsky

On 3/7/17 9:26 AM, jma...@mozilla.com wrote:

We find that we are fixing 35% of the bugs and disabling 23% of them.


Is there a credible plan for reenabling the ones we disable?

-Boris
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Project Stockwell (reducing intermittents) - March 2017 update

2017-03-07 Thread jmaher
In recent months we have been triaging high frequency (>=30 times/week) 
failures in automated tests.  We find that we are fixing 35% of the bugs and 
disabling 23% of them.

The great news is we are fixing many of the issues.  The sad news is we are 
disabling tests, but usually only after giving a bug 2+ weeks of time to get 
fixed.

In March, we want to find a way to disable the teststhat are causing the most 
pain or are most likely not to be fixed, without unduly jeopardizing the chance 
that these bugs will be fixed.  We propose:
1) all high frequency (>=30/week) intermittent failure bugs will have 2 weeks 
from initial triage to get fixed, otherwise we will disable the test case.
2) all very high frequency bugs (>=75/week) will have 1 week from initial 
triage to get fixed, otherwise we will disable the test case.

We still plan to only pester once/week.  If a test has fallen out of our 
definition of high frequency, we are happy to make a note in the bug and adjust 
expectations.

Since we are changing this, we expect a few more disabled tests, but do not 
expect us to shift the balance of fixed vs disabled.  We also expect our Orange 
Factor to be <7.0 by the end of the month.

Thanks to everyone for their on-going efforts to fix frequent intermittent test 
failures; together we can make test results more reliable and less confusing 
for everyone.

Here is a blog post with more data and information about the project:
https://elvis314.wordpress.com/2017/03/07/project-stockwell-reduce-intermittents-march-2017/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform