Re: [DISCUSS] Releasable trunk and quality

Joshua McKenzie Thu, 04 Nov 2021 05:24:56 -0700

>
> we noticed CI going from a
> steady 3-ish failures to many and it's getting fixed. So we're moving in
> the right direction imo.
>
An observation about this: there's tooling and technology widely in use to
help prevent ever getting into this state (to Benedict's point: blocking
merge on CI failure, or nightly tests and reverting regression commits,
etc). I think there's significant time and energy savings for us in using
automation to be proactive about the quality of our test boards rather than
reactive.


I 100% agree that it's heartening to see that the quality of the codebase
is improving as is the discipline / attentiveness of our collective
culture. That said, I believe we still have a pretty fragile system when it
comes to test failure accumulation.

On Thu, Nov 4, 2021 at 2:46 AM Berenguer Blasi <[email protected]>
wrote:

> I agree with David. CI has been pretty reliable besides the random
> jenkins going down or timeout. The same 3 or 4 tests were the only flaky
> ones in jenkins and Circle was very green. I bisected a couple failures
> to legit code errors, David is fixing some more, others have as well, etc
>
> It is good news imo as we're just getting to learn our CI post 4.0 is
> reliable and we need to start treating it as so and paying attention to
> it's reports. Not perfect but reliable enough it would have prevented
> those bugs getting merged.
>
> In fact we're having this conversation bc we noticed CI going from a
> steady 3-ish failures to many and it's getting fixed. So we're moving in
> the right direction imo.
>
> On 3/11/21 19:25, David Capwell wrote:
> >> It’s hard to gate commit on a clean CI run when there’s flaky tests
> > I agree, this is also why so much effort was done in 4.0 release to
> remove as much as possible.  Just over 1 month ago we were not really
> having a flaky test issue (outside of the sporadic timeout issues; my
> circle ci runs were green constantly), and now the “flaky tests” I see are
> all actual bugs (been root causing 2 out of the 3 I reported) and some (not
> all) of the flakyness was triggered by recent changes in the past month.
> >
> > Right now people do not believe the failing test is caused by their
> patch and attribute to flakiness, which then causes the builds to start
> being flaky, which then leads to a different author coming to fix the
> issue; this behavior is what I would love to see go away.  If we find a
> flaky test, we should do the following
> >
> > 1) has it already been reported and who is working to fix?  Can we block
> this patch on the test being fixed?  Flaky tests due to timing issues
> normally are resolved very quickly, real bugs take longer.
> > 2) if not reported, why?  If you are the first to see this issue than
> good chance the patch caused the issue so should root cause.  If you are
> not the first to see it, why did others not report it (we tend to be good
> about this, even to the point Brandon has to mark the new tickets as dups…)?
> >
> > I have committed when there were flakiness, and I have caused flakiness;
> not saying I am perfect or that I do the above, just saying that if we all
> moved to the above model we could start relying on CI.  The biggest impact
> to our stability is people actually root causing flaky tests.
> >
> >>  I think we're going to need a system that
> >> understands the difference between success, failure, and timeouts
> >
> > I am curious how this system can know that the timeout is not an actual
> failure.  There was a bug in 4.0 with time serialization in message, which
> would cause the message to get dropped; this presented itself as a timeout
> if I remember properly (Jon Meredith or Yifan Cai fixed this bug I believe).
> >
> >> On Nov 3, 2021, at 10:56 AM, Brandon Williams <[email protected]> wrote:
> >>
> >> On Wed, Nov 3, 2021 at 12:35 PM [email protected] <
> [email protected]> wrote:
> >>> The largest number of test failures turn out (as pointed out by David)
> to be due to how arcane it was to trigger the full test suite. Hopefully we
> can get on top of that, but I think a significant remaining issue is a lack
> of trust in the output of CI. It’s hard to gate commit on a clean CI run
> when there’s flaky tests, and it doesn’t take much to misattribute one
> failing test to the existing flakiness (I tend to compare to a run of the
> trunk baseline for comparison, but this is burdensome and still error
> prone). The more flaky tests there are the more likely this is.
> >>>
> >>> This is in my opinion the real cost of flaky tests, and it’s probably
> worth trying to crack down on them hard if we can. It’s possible the
> Simulator may help here, when I finally finish it up, as we can port flaky
> tests to run with the Simulator and the failing seed can then be explored
> deterministically (all being well).
> >> I totally agree that the lack of trust is a driving problem here, even
> >> in knowing which CI system to rely on. When Jenkins broke but Circle
> >> was fine, we all assumed it was a problem with Jenkins, right up until
> >> Circle also broke.
> >>
> >> In testing a distributed system like this I think we're always going
> >> to have failures, even on non-flaky tests, simply because the
> >> underlying infrastructure is variable with transient failures of its
> >> own (the network is reliable!)  We can fix the flakies where the fault
> >> is in the code (and we've done this to many already) but to get more
> >> trustworthy output, I think we're going to need a system that
> >> understands the difference between success, failure, and timeouts, and
> >> in the latter case knows how to at least mark them differently.
> >> Simulator may help, as do the in-jvm dtests, but there is ultimately
> >> no way to cover everything without doing some things the hard, more
> >> realistic way where sometimes shit happens, marring the almost-perfect
> >> runs with noisy doubt, which then has to be sifted through to
> >> determine if there was a real issue.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [DISCUSS] Releasable trunk and quality

Reply via email to