I've been reading this thread with much sadness, but refraining from
commenting because I have nothing good to say. But I feel like I
should probably comment regardless. What makes me sad is all the
developers in this thread trying to push back against disabling of
clearly problematic tests, asking for things like tracking bugs and
needinfos, when the reality is that the developers just don't have the
time to deal with new bugs or needinfos. After all, the test was
disabled only *after* bugs were filed and people were needinfo'd with
no response. In my experience, Joel and the sheriffing team put in a
lot of hard work starring these things and filing bugs, only to have a
lot of that work ignored. Except of course when it comes to threads
like this, when everybody has an opinion, mostly asking for even more
of them. (I am aware that many developers do actually investigate and
fix intermittents, and that set of developers is orthogonal of the set
of developers pushing back in this thread, but it still disheartens
me).

I think we have a structural problem where developers are insulated
from the actual fallout of bad tests. They're not the ones who have to
deal with starring failures. For a while now I've been doing this work
on the graphics branch, and it's not fun at all. There are so many
failures (and variations of failures) on a typical push that it
becomes very easy to "zone out" and just pick a sort-of-matches
existing failure and star a new failure with it. This leads to
mis-starring of stuff and missing new legitimate failures. I'm sure
the sheriffs do a better job than I do, but even so the trend is clear
to me: more unsolved intermittents will result in more bad data across
the board, and more missed legitimate failures. (Think of it like
fixing warnings in code - if you already have reams of warnings you're
easily going to miss a new one being added).

Sure, there's tooling we can improve. And sure, we can tweak the way
we process these things. And there are some good suggestions upthread
that we should do. But I feel fundamentally the problem is that
developers have no real incentive (except for "pride") to fix
intermittents. Even disabling tests is not really an incentive, as
there is no negative effect to the developer when it happens. The
naive solution to aligning incentives is to make more developers
responsible for starring failures. I believe this was done in the
past, but I don't really want to go back there. For one thing, we
would lose a valuable across-the-board view of common types of similar
errors, and sheriffs do a lot of non-obvious starring as well
(sometimes the relevant error message doesn't show up in the TH view)
or recall bugs from memory which is hard for other people to do.

Another potential solution is block developers until they fix their
tests. This is actually not as crazy as it sounds. After all, when
landing code, reviewers can (and should) block developers from landing
until they have tests that go with the code. So why is it that the bar
is lowered later, and the test is allowed to be disabled without
backing out or disabling the corresponding code? (Rhetorical
question). It's not easy to back out or disable code, but it's much
easier to prevent a developer from landing new code if they have a
backlog of intermittents or disabled tests. But again, this is also a
bad solution for various reasons.

I've been around long enough to realize that there's no real good
solution here. And that makes me sad. But I also do want to really
thank Joel and the sheriffs and anybody else who has to deal with
pointy end of this problem for doing the great job that they do, given
the circumstances. And I'd like to encourage developers to try and
take more responsibility for intermittent failures, even in the
absence of adequate tooling and STR and all that other good stuff. At
the very least respond to your needinfos.

Cheers,
kats


On Wed, Mar 8, 2017 at 5:18 AM,  <jma...@mozilla.com> wrote:
> On Tuesday, March 7, 2017 at 11:45:38 PM UTC-5, Chris Pearce wrote:
>> I recommend that instead of classifying intermittents as tests which fail > 
>> 30 times per week, to instead classify tests that fail more than some 
>> threshold percent as intermittent. Otherwise on a week with lots of 
>> checkins, a test which isn't actually a problem could clear the threshold 
>> and cause unnecessary work for orange triage people and developers alike.
>>
>> The currently published threshold is 8%:
>>
>> https://wiki.mozilla.org/Sheriffing/Test_Disabling_Policy#Identifying_problematic_tests
>>
>> 8% seems reasonable to me.
>>
>> Also, whenever a test is disabled, not only should a bug be filed, but 
>> please _please_ need-info the test owner or at least someone on the affected 
>> team.
>>
>> If a test for a feature is disabled without the maintainer of that feature 
>> knowing, then we are flying blind and we are putting the quality of our 
>> product at risk.
>>
>>
>> cpearce.
>>
>
> Thanks cpearce for the concern here.  Regarding disabling tests, all tests 
> that we have disabled as part of the stockwell project have started out with 
> a triage where we ni the responsible party and the bug is filed in the 
> component where the test is associated with.  I assume if the bug is filed in 
> the right component others from the team will be made aware of this.  Right 
> now I assume the triage owner of a component is the owner of the tests and 
> can proxy the request to the correct person on the team (many times the 
> original author is on PTO, busy with a project, left the team, etc.).  Please 
> let me know if this is a false assumption and what we could do to better get 
> bugs in front of the right people.
>
> I agree 8% is a good number, the sheriff policy has other criteria (top 20 on 
> orange factor, 100 times/month).  We picked 30 times/week as that is where 
> bugs start becoming frequent enough to easily reproduce (locally or on try) 
> and it would be reasonable to expect a fix.  There is ambiguity when using a 
> %, on a low volume week (as most of december was) we see <500 pushes/week, 
> also the % doesn't indicate the amount of times the test was run- this is 
> affected by SETA (reducing tests for 4/5 commits to save on load) and by 
> people doing retriggers/backfills.  If last week the test was 8%, and it is 
> 7% this week- do we ignore it?
>
> Picking a single number like 30 times/7days removes ambiguity and ensures 
> that we can stay focused on things and don't have to worry about 
> recalculations.  It is true on lower volume weeks that 30 times/7days doesn't 
> happen as frequently, yet we have always had many bugs to work on with that 
> threshold.
> _______________________________________________
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to