+1 and the same should go for @ignore attributes as well.
> On Jul 6, 2018, at 11:10 AM, Alexander Murmann <amurm...@pivotal.io> wrote: > > +1 for fixing immediately. > > Since Dan is already trying to shake out more brittleness this seems to be > the right time to get rid of the flaky label. Let's just treat all test the > same and fix them. > >> On Fri, Jul 6, 2018 at 9:31 AM, Kirk Lund <kl...@apache.org> wrote: >> >> I should add that I'm only in favor of deleting the category if we have a >> new policy of any failure means we have to fix the test and/or product >> code. Even if you think that failure is in a test that you or your team is >> not responsible for. That's no excuse to ignore a failure in your private >> precheckin. >> >>> On Fri, Jul 6, 2018 at 9:29 AM, Dale Emery <dem...@pivotal.io> wrote: >>> >>> The pattern I’ve seen in lots of other organizations: When a few tests >>> intermittently give different answers, people attribute the intermittence >>> to the tests, quickly lose trust in the entire suite, and increasingly >>> discount failures. >>> >>> If we’re going to attend to every failure in the larger suite, then we >>> won’t suffer that fate, and I’m in favor of deleting the Flaky tag. >>> >>> Dale >>> >>>> On Jul 5, 2018, at 8:15 PM, Dan Smith <dsm...@pivotal.io> wrote: >>>> >>>> Honestly I've never liked the flaky category. What it means is that at >>> some >>>> point in the past, we decided to put off tracking down and fixing a >>> failure >>>> and now we're left with a bug number and a description and that's it. >>>> >>>> I think we will be better off if we just get rid of the flaky category >>>> entirely. That way no one can label anything else as flaky and push it >>> off >>>> for later, and if flaky tests fail again we will actually prioritize >> and >>>> fix them instead of ignoring them. >>>> >>>> I think Patrick was looking at rerunning the flaky tests to see what is >>>> still failing. How about we just run the whole flaky suite some number >> of >>>> times (100?), fix whatever is still failing and close out and remove >> the >>>> category from the rest? >>>> >>>> I think will we get more benefit from shaking out and fixing the issues >>> we >>>> have in the current codebase than we will from carefully explaining the >>>> flaky failures from the past. >>>> >>>> -Dan >>>> >>>>> On Thu, Jul 5, 2018 at 7:03 PM, Dale Emery <dem...@pivotal.io> wrote: >>>>> >>>>> Hi Alexander and all, >>>>> >>>>>> On Jul 5, 2018, at 5:11 PM, Alexander Murmann <amurm...@pivotal.io> >>>>> wrote: >>>>>> >>>>>> Hi everyone! >>>>>> >>>>>> Dan Smith started a discussion about shaking out more flaky DUnit >>> tests. >>>>>> That's a great effort and I am happy it's happening. >>>>>> >>>>>> As a corollary to that conversation I wonder what the criteria should >>> be >>>>>> for a test to not be considered flaky any longer and have the >> category >>>>>> removed. >>>>>> >>>>>> In general the bar should be fairly high. Even if a test only fails >> ~1 >>> in >>>>>> 500 runs that's still a problem given how many tests we have. >>>>>> >>>>>> I see two ends of the spectrum: >>>>>> 1. We have a good understanding why the test was flaky and think we >>> fixed >>>>>> it. >>>>>> 2. We have a hard time reproducing the flaky behavior and have no >> good >>>>>> theory as to why the test might have shown flaky behavior. >>>>>> >>>>>> In the first case I'd suggest to run the test ~100 times to get a >>> little >>>>>> more confidence that we fixed the flaky behavior and then remove the >>>>>> category. >>>>> >>>>> Here’s a test for case 1: >>>>> >>>>> If we really understand why it was flaky, we will be able to: >>>>> - Identify the “faults”—the broken places in the code (whether >> system >>>>> code or test code). >>>>> - Identify the exact conditions under which those faults led to the >>>>> failures we observed. >>>>> - Explain how those faults, under those conditions. led to those >>>>> failures. >>>>> - Run unit tests that exercise the code under those same >> conditions, >>>>> and demonstrate that >>>>> the formerly broken code now does the right thing. >>>>> >>>>> If we’re lacking any of these things, I’d say we’re dealing with case >> 2. >>>>> >>>>>> The second case is a lot more problematic. How often do we want to >> run >>> a >>>>>> test like that before we decide that it might have been fixed since >> we >>>>> last >>>>>> saw it happen? Anything else we could/should do to verify the test >>>>> deserves >>>>>> our trust again? >>>>> >>>>> >>>>> I would want a clear, compelling explanation of the failures we >>> observed. >>>>> >>>>> Clear and compelling are subjective, of course. For me, clear and >>>>> compelling would include >>>>> descriptions of: >>>>> - The faults in the code. What, specifically, was broken. >>>>> - The specific conditions under which the code did the wrong thing. >>>>> - How those faults, under those conditions, led to those failures. >>>>> - How the fix either prevents those conditions, or causes the >> formerly >>>>> broken code to >>>>> now do the right thing. >>>>> >>>>> Even if we don’t have all of these elements, we may have some of them. >>>>> That can help us >>>>> calibrate our confidence. But the elements work together. If we’re >>> lacking >>>>> one, the others >>>>> are shaky, to some extent. >>>>> >>>>> The more elements are missing in our explanation, the more times I’d >>> want >>>>> to run the test >>>>> before trusting it. >>>>> >>>>> Cheers, >>>>> Dale >>>>> >>>>> — >>>>> Dale Emery >>>>> dem...@pivotal.io >>>>> >>>>> >>> >>> — >>> Dale Emery >>> dem...@pivotal.io >>> >>> >>> >>> >>> >>