FYI, most of the intermittent failures I've reviewed and fixed were caused
by a Java GC during a small window in which the test assumes something
asynchronous will complete without using Awaitility. So these tests are
still inherently flaky if you don't change them. Unless you can find the
async race just by reading the test, then some sort of CPU load tool would
be needed. Running these tests thousands of times to determine that they
won't flake out in the future won't work very well except for detecting
asynchronous races that are already close to failing.

On Tue, Jul 10, 2018 at 9:42 AM, Patrick Rhomberg <prhomb...@pivotal.io>
wrote:

> Belatedly...
>
>   During our investigation into the application of the FlakyTest category,
> we ran the FlakyTest precheckin target approximately 1,500 times over this
> past weekend.  An abundance of these tests passed without failure.  In my
> opinion, this provides the base confidence in these tests, and they should
> have their FlakyTest categorization removed.
>
>   If some hard-to-reach race condition is still present and causes the
> pipeline goes red in the future, effort should then be expended to fully
> analyze and fix/refactor the test at that time.
>
>   Those tests that did fail will remain categorized Flaky, with an
> immediate emphasis on analysis and refactoring to correct the test, restore
> confidence, and ultimately remove the FlakyTest category.
>
> A PR for the above should appear before the end of the day today.  So
> huzzah for forward progress!
>
> On Mon, Jul 9, 2018 at 7:19 AM, Jacob Barrett <jbarr...@pivotal.io> wrote:
>
> > +1 and the same should go for @ignore attributes as well.
> >
> > > On Jul 6, 2018, at 11:10 AM, Alexander Murmann <amurm...@pivotal.io>
> > wrote:
> > >
> > > +1 for fixing immediately.
> > >
> > > Since Dan is already trying to shake out more brittleness this seems to
> > be
> > > the right time to get rid of the flaky label. Let's just treat all test
> > the
> > > same and fix them.
> > >
> > >> On Fri, Jul 6, 2018 at 9:31 AM, Kirk Lund <kl...@apache.org> wrote:
> > >>
> > >> I should add that I'm only in favor of deleting the category if we
> have
> > a
> > >> new policy of any failure means we have to fix the test and/or product
> > >> code. Even if you think that failure is in a test that you or your
> team
> > is
> > >> not responsible for. That's no excuse to ignore a failure in your
> > private
> > >> precheckin.
> > >>
> > >>> On Fri, Jul 6, 2018 at 9:29 AM, Dale Emery <dem...@pivotal.io>
> wrote:
> > >>>
> > >>> The pattern I’ve seen in lots of other organizations: When a few
> tests
> > >>> intermittently give different answers, people attribute the
> > intermittence
> > >>> to the tests, quickly lose trust in the entire suite, and
> increasingly
> > >>> discount failures.
> > >>>
> > >>> If we’re going to attend to every failure in the larger suite, then
> we
> > >>> won’t suffer that fate, and I’m in favor of deleting the Flaky tag.
> > >>>
> > >>> Dale
> > >>>
> > >>>> On Jul 5, 2018, at 8:15 PM, Dan Smith <dsm...@pivotal.io> wrote:
> > >>>>
> > >>>> Honestly I've never liked the flaky category. What it means is that
> at
> > >>> some
> > >>>> point in the past, we decided to put off tracking down and fixing a
> > >>> failure
> > >>>> and now we're left with a bug number and a description and that's
> it.
> > >>>>
> > >>>> I think we will be better off if we just get rid of the flaky
> category
> > >>>> entirely. That way no one can label anything else as flaky and push
> it
> > >>> off
> > >>>> for later, and if flaky tests fail again we will actually prioritize
> > >> and
> > >>>> fix them instead of ignoring them.
> > >>>>
> > >>>> I think Patrick was looking at rerunning the flaky tests to see what
> > is
> > >>>> still failing. How about we just run the whole flaky suite some
> number
> > >> of
> > >>>> times (100?), fix whatever is still failing and close out and remove
> > >> the
> > >>>> category from the rest?
> > >>>>
> > >>>> I think will we get more benefit from shaking out and fixing the
> > issues
> > >>> we
> > >>>> have in the current codebase than we will from carefully explaining
> > the
> > >>>> flaky failures from the past.
> > >>>>
> > >>>> -Dan
> > >>>>
> > >>>>> On Thu, Jul 5, 2018 at 7:03 PM, Dale Emery <dem...@pivotal.io>
> > wrote:
> > >>>>>
> > >>>>> Hi Alexander and all,
> > >>>>>
> > >>>>>> On Jul 5, 2018, at 5:11 PM, Alexander Murmann <
> amurm...@pivotal.io>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>> Hi everyone!
> > >>>>>>
> > >>>>>> Dan Smith started a discussion about shaking out more flaky DUnit
> > >>> tests.
> > >>>>>> That's a great effort and I am happy it's happening.
> > >>>>>>
> > >>>>>> As a corollary to that conversation I wonder what the criteria
> > should
> > >>> be
> > >>>>>> for a test to not be considered flaky any longer and have the
> > >> category
> > >>>>>> removed.
> > >>>>>>
> > >>>>>> In general the bar should be fairly high. Even if a test only
> fails
> > >> ~1
> > >>> in
> > >>>>>> 500 runs that's still a problem given how many tests we have.
> > >>>>>>
> > >>>>>> I see two ends of the spectrum:
> > >>>>>> 1. We have a good understanding why the test was flaky and think
> we
> > >>> fixed
> > >>>>>> it.
> > >>>>>> 2. We have a hard time reproducing the flaky behavior and have no
> > >> good
> > >>>>>> theory as to why the test might have shown flaky behavior.
> > >>>>>>
> > >>>>>> In the first case I'd suggest to run the test ~100 times to get a
> > >>> little
> > >>>>>> more confidence that we fixed the flaky behavior and then remove
> the
> > >>>>>> category.
> > >>>>>
> > >>>>> Here’s a test for case 1:
> > >>>>>
> > >>>>> If we really understand why it was flaky, we will be able to:
> > >>>>>   - Identify the “faults”—the broken places in the code (whether
> > >> system
> > >>>>> code or test code).
> > >>>>>   - Identify the exact conditions under which those faults led to
> the
> > >>>>> failures we observed.
> > >>>>>   - Explain how those faults, under those conditions. led to those
> > >>>>> failures.
> > >>>>>   - Run unit tests that exercise the code under those same
> > >> conditions,
> > >>>>> and demonstrate that
> > >>>>>     the formerly broken code now does the right thing.
> > >>>>>
> > >>>>> If we’re lacking any of these things, I’d say we’re dealing with
> case
> > >> 2.
> > >>>>>
> > >>>>>> The second case is a lot more problematic. How often do we want to
> > >> run
> > >>> a
> > >>>>>> test like that before we decide that it might have been fixed
> since
> > >> we
> > >>>>> last
> > >>>>>> saw it happen? Anything else we could/should do to verify the test
> > >>>>> deserves
> > >>>>>> our trust again?
> > >>>>>
> > >>>>>
> > >>>>> I would want a clear, compelling explanation of the failures we
> > >>> observed.
> > >>>>>
> > >>>>> Clear and compelling are subjective, of course. For me, clear and
> > >>>>> compelling would include
> > >>>>> descriptions of:
> > >>>>>  - The faults in the code. What, specifically, was broken.
> > >>>>>  - The specific conditions under which the code did the wrong
> thing.
> > >>>>>  - How those faults, under those conditions, led to those failures.
> > >>>>>  - How the fix either prevents those conditions, or causes the
> > >> formerly
> > >>>>> broken code to
> > >>>>>    now do the right thing.
> > >>>>>
> > >>>>> Even if we don’t have all of these elements, we may have some of
> > them.
> > >>>>> That can help us
> > >>>>> calibrate our confidence. But the elements work together. If we’re
> > >>> lacking
> > >>>>> one, the others
> > >>>>> are shaky, to some extent.
> > >>>>>
> > >>>>> The more elements are missing in our explanation, the more times
> I’d
> > >>> want
> > >>>>> to run the test
> > >>>>> before trusting it.
> > >>>>>
> > >>>>> Cheers,
> > >>>>> Dale
> > >>>>>
> > >>>>> —
> > >>>>> Dale Emery
> > >>>>> dem...@pivotal.io
> > >>>>>
> > >>>>>
> > >>>
> > >>> —
> > >>> Dale Emery
> > >>> dem...@pivotal.io
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> >
>

Reply via email to