> > we noticed CI going from a > steady 3-ish failures to many and it's getting fixed. So we're moving in > the right direction imo. > An observation about this: there's tooling and technology widely in use to help prevent ever getting into this state (to Benedict's point: blocking merge on CI failure, or nightly tests and reverting regression commits, etc). I think there's significant time and energy savings for us in using automation to be proactive about the quality of our test boards rather than reactive.
I 100% agree that it's heartening to see that the quality of the codebase is improving as is the discipline / attentiveness of our collective culture. That said, I believe we still have a pretty fragile system when it comes to test failure accumulation. On Thu, Nov 4, 2021 at 2:46 AM Berenguer Blasi <berenguerbl...@gmail.com> wrote: > I agree with David. CI has been pretty reliable besides the random > jenkins going down or timeout. The same 3 or 4 tests were the only flaky > ones in jenkins and Circle was very green. I bisected a couple failures > to legit code errors, David is fixing some more, others have as well, etc > > It is good news imo as we're just getting to learn our CI post 4.0 is > reliable and we need to start treating it as so and paying attention to > it's reports. Not perfect but reliable enough it would have prevented > those bugs getting merged. > > In fact we're having this conversation bc we noticed CI going from a > steady 3-ish failures to many and it's getting fixed. So we're moving in > the right direction imo. > > On 3/11/21 19:25, David Capwell wrote: > >> It’s hard to gate commit on a clean CI run when there’s flaky tests > > I agree, this is also why so much effort was done in 4.0 release to > remove as much as possible. Just over 1 month ago we were not really > having a flaky test issue (outside of the sporadic timeout issues; my > circle ci runs were green constantly), and now the “flaky tests” I see are > all actual bugs (been root causing 2 out of the 3 I reported) and some (not > all) of the flakyness was triggered by recent changes in the past month. > > > > Right now people do not believe the failing test is caused by their > patch and attribute to flakiness, which then causes the builds to start > being flaky, which then leads to a different author coming to fix the > issue; this behavior is what I would love to see go away. If we find a > flaky test, we should do the following > > > > 1) has it already been reported and who is working to fix? Can we block > this patch on the test being fixed? Flaky tests due to timing issues > normally are resolved very quickly, real bugs take longer. > > 2) if not reported, why? If you are the first to see this issue than > good chance the patch caused the issue so should root cause. If you are > not the first to see it, why did others not report it (we tend to be good > about this, even to the point Brandon has to mark the new tickets as dups…)? > > > > I have committed when there were flakiness, and I have caused flakiness; > not saying I am perfect or that I do the above, just saying that if we all > moved to the above model we could start relying on CI. The biggest impact > to our stability is people actually root causing flaky tests. > > > >> I think we're going to need a system that > >> understands the difference between success, failure, and timeouts > > > > I am curious how this system can know that the timeout is not an actual > failure. There was a bug in 4.0 with time serialization in message, which > would cause the message to get dropped; this presented itself as a timeout > if I remember properly (Jon Meredith or Yifan Cai fixed this bug I believe). > > > >> On Nov 3, 2021, at 10:56 AM, Brandon Williams <dri...@gmail.com> wrote: > >> > >> On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org < > bened...@apache.org> wrote: > >>> The largest number of test failures turn out (as pointed out by David) > to be due to how arcane it was to trigger the full test suite. Hopefully we > can get on top of that, but I think a significant remaining issue is a lack > of trust in the output of CI. It’s hard to gate commit on a clean CI run > when there’s flaky tests, and it doesn’t take much to misattribute one > failing test to the existing flakiness (I tend to compare to a run of the > trunk baseline for comparison, but this is burdensome and still error > prone). The more flaky tests there are the more likely this is. > >>> > >>> This is in my opinion the real cost of flaky tests, and it’s probably > worth trying to crack down on them hard if we can. It’s possible the > Simulator may help here, when I finally finish it up, as we can port flaky > tests to run with the Simulator and the failing seed can then be explored > deterministically (all being well). > >> I totally agree that the lack of trust is a driving problem here, even > >> in knowing which CI system to rely on. When Jenkins broke but Circle > >> was fine, we all assumed it was a problem with Jenkins, right up until > >> Circle also broke. > >> > >> In testing a distributed system like this I think we're always going > >> to have failures, even on non-flaky tests, simply because the > >> underlying infrastructure is variable with transient failures of its > >> own (the network is reliable!) We can fix the flakies where the fault > >> is in the code (and we've done this to many already) but to get more > >> trustworthy output, I think we're going to need a system that > >> understands the difference between success, failure, and timeouts, and > >> in the latter case knows how to at least mark them differently. > >> Simulator may help, as do the in-jvm dtests, but there is ultimately > >> no way to cover everything without doing some things the hard, more > >> realistic way where sometimes shit happens, marring the almost-perfect > >> runs with noisy doubt, which then has to be sifted through to > >> determine if there was a real issue. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > >> For additional commands, e-mail: dev-h...@cassandra.apache.org > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > >