Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

David Arthur Wed, 14 Jun 2023 09:48:37 -0700

>
> > How would we handle commits that break the integration tests? Would
> > we revert commits on trunk, or fix-forward?
>
> This is currently up to committer discretion, and I don't think that
> would change if we were to re-tool the PR builds. In the presence of
> flaky failures, we can't reliably blame failures on particular commits
> without running much more expensive statistical tests.
>


I was more thinking about integration regressions rather than flaky
failures. However, if we're running a module's tests as well as the
"affected modules" tests, then I guess we should
be running the correct integration tests for each PR build. I guess this
gets back to Chris's point about how this approach favors the downstream
modules. Maybe that's unavoidable.

Come to think of it, a similar problem exists with system tests. We don't
run these for each PR (or each trunk commit, or even nightly AFAIK) since
they are prohibitively costly/lengthy to run.
However, they do sometimes find integration regressions that the junit
suite missed. In these cases we only have the choice to fix-forward.


On Tue, Jun 13, 2023 at 12:19 PM Greg Harris <greg.har...@aiven.io.invalid>
wrote:

> David,
>
> Thanks for finding that gradle plugin. The `changedModules` mode is
> exactly what I had in mind for fairness to modules earlier in the
> dependency graph.
>
> > if we moved to a policy where PRs only need some of the tests to pass
> > to merge, when would we run the full CI? On each trunk commit (i.e., PR
> > merge)?
>
> In a world where the PR runs includes only the changed modules and
> their dependencies, the full suite should be run for each commit on
> trunk and on release branches. I don't think that optimizing the trunk
> build runtime is of great benefit, and the current behavior seems
> reasonable to continue.
>
> > How would we handle commits that break the integration tests? Would
> > we revert commits on trunk, or fix-forward?
>
> This is currently up to committer discretion, and I don't think that
> would change if we were to re-tool the PR builds. In the presence of
> flaky failures, we can't reliably blame failures on particular commits
> without running much more expensive statistical tests.
>
> One place that I often see flakiness present is in new tests, where
> someone has chosen timeouts which work for them locally and in the PR
> build. After some 10s or 100s of runs, the flakiness becomes evident
> and someone looks into a fix-forward.
> I don't necessarily think I would advocate for a hard revert of an
> entire feature if one of the added tests is flaky, but that's my
> discretion. We can adopt a project policy of reverting whatever we
> can, but I don't think that's a more welcoming or productive project
> than we have now.
>
> Greg
>
> On Tue, Jun 13, 2023 at 7:24 AM David Arthur <mum...@gmail.com> wrote:
> >
> > Hey folks, interesting discussion.
> >
> > I came across a Gradle plugin that calculates a DAG of modules based on
> the
> > diff and can run only the affected module's tests or the affected +
> > downstream tests.
> >
> > https://github.com/dropbox/AffectedModuleDetector
> >
> > I tested it out locally, and it seems to work as advertised.
> >
> > Greg, if we moved to a policy where PRs only need some of the tests to
> pass
> > to merge, when would we run the full CI? On each trunk commit (i.e., PR
> > merge)? How would we handle commits that break the integration tests?
> Would
> > we revert commits on trunk, or fix-forward?
> >
> > -David
> >
> >
> > On Thu, Jun 8, 2023 at 2:02 PM Greg Harris <greg.har...@aiven.io.invalid
> >
> > wrote:
> >
> > > Gaurav,
> > >
> > > The target-determinator is certainly the "off-the-shelf" solution I
> > > expected would be out there. If the project migrates to Bazel I think
> > > that would make the partial builds much easier to implement.
> > > I think we should look into the other benefits of migrating to bazel
> > > to see if it is worth it even if the partial builds feature is decided
> > > against, or after it is reverted.
> > >
> > > Chris,
> > >
> > > > Do you think we should aim to disable
> > > > merges without a full suite of passing CI runs (allowing for
> > > administrative
> > > > override in an emergency)? If so, what would the path be from our
> current
> > > > state to there? What can we do to ensure that we don't get stuck
> relying
> > > on
> > > > a once-temporary aid that becomes effectively permanent?
> > >
> > > Yes I think it would be nice to require a green build to merge,
> > > without excessive merge-queue infrastructure that is more common in
> > > high-volume monorepos.
> > >
> > > 1. We would decide on a sunset date on the partial build indicator, at
> > > which point it will be disabled on trunk and all release branches. I
> > > suspect that a sunset date could be set for 1-3 releases in the
> > > future.
> > > 2. We would enable partial builds. The merge-button would remain
> > > as-is, with the partial build and full-suite indicators both being
> > > visible.
> > > 3. Communication would be sent out to explain that the partial build
> > > indicators are a temporary flakiness reduction tool.
> > > 4. Committers would continue to rely on the full-suite indicators, and
> > > watch the partial build to get a sense of the level of flakiness in
> > > each module.
> > > 5. On new contributions which have failing partial builds, Committers
> > > can ask that contributors investigate and follow-up on flaky failures
> > > that appeared in their PR, explaining that they can help make that
> > > indicator green for future contributions.
> > > 6. After the sunset date passes, the partial build indicators will be
> > > disabled.
> > > 7. If the main trunk build is sufficiently reliable by some criteria
> > > (e.g. 20 passes in a row) we can discuss requiring a green build for
> > > the GitHub merge button.
> > >
> > > > We probably
> > > > want to build awareness of this dependency graph into any partial CI
> > > logic
> > > > we add, but if we do opt for that, then this change would
> > > > disproportionately benefit downstream modules (Streams, Connect,
> MM2),
> > > and
> > > > have little to no benefit for upstream ones (clients and at least
> some
> > > core
> > > > modules).
> > >
> > > I considered that, and I think it is more beneficial to provide
> > > equitable partial builds than ones with perfect coverage. If you want
> > > to know about cross-module failures, I think that the full suite is
> > > still the appropriate tool to detect those.
> > > From another perspective, if this change is only useful to `core`
> > > after all other dependent modules have addressed their flakiness, then
> > > it delays `core` contributors from the reward for addressing their
> > > flakiness.
> > > And if anything, making the partial build cover fewer failure modes
> > > and reminding people of that fact will keep the full-suite builds
> > > relevant.
> > >
> > > > but people should already be making
> > > > sure that tests are running locally before they push changes
> > >
> > > I agree, and I don't think that partial builds are in any danger of
> > > replacing local testing for fast synchronous iteration. I also agree
> > > that it is appropriate to give personal feedback to repeat
> > > contributors to improve the quality of their PRs.
> > > CI seems to be for enforcing the lowest-common-denominator on
> > > contributions, and benefiting (potentially first-time) contributors
> > > which do not have the familiarity, mindfulness, or resources to
> > > pre-test each of their contributions.
> > > It is in those situations that I think partial builds can be
> > > beneficial: if there is something mechanical that all contributors
> > > should be doing locally, why not have it done automatically on their
> > > behalf?
> > >
> > > > Finally, since there are a number of existing flaky tests on trunk,
> what
> > > > would the strategy be for handling those? Do we try to get to a green
> > > state
> > > > on a per-module basis (possibly with awareness of downstream
> modules) as
> > > > quickly as possible, and then selectively enable partial builds once
> we
> > > > feel confident that flakiness has been addressed?
> > >
> > > For modules which receive regular PRs, the partial builds would
> > > surface those flakes, incentivising fixing them. Once the per-module
> > > flakiness is under control, committers for those modules can start to
> > > require green partial builds, pushing back on PRs which have obvious
> > > failures.
> > > For modules which do not receive regular PRs, those flakes will appear
> > > in all of the full-suite builds, and wouldn't be addressed by the
> > > partial builds. Those would require either another incentive
> > > structure, or would need volunteers from other modules to help fix
> > > them.
> > > I'm not sure which modules receive the least PR traffic, but I can see
> > > that the current most stale modules are log4j-appender 1 year ago,
> > > streams:examples 6 months ago, connect:file 5 months ago, storage:api
> > > 5 months ago, and streams-scala 5 months ago.
> > >
> > > Thanks for the feedback all,
> > > Greg
> > >
> > > On Wed, Jun 7, 2023 at 7:55 AM Chris Egerton <chr...@aiven.io.invalid>
> > > wrote:
> > > >
> > > > Hi Greg,
> > > >
> > > > I can see the point about enabling partial runs as a temporary
> measure to
> > > > fight flakiness, and it does carry some merit. In that case, though,
> we
> > > > should have an idea of what the desired end state is once we've
> stopped
> > > > relying on any temporary measures. Do you think we should aim to
> disable
> > > > merges without a full suite of passing CI runs (allowing for
> > > administrative
> > > > override in an emergency)? If so, what would the path be from our
> current
> > > > state to there? What can we do to ensure that we don't get stuck
> relying
> > > on
> > > > a once-temporary aid that becomes effectively permanent?
> > > >
> > > > With partial builds, we also need to be careful to make sure to
> correctly
> > > > handle cross-module dependencies. A tweak to broker or client logic
> may
> > > > only affect files in one module and pass all tests for that module,
> but
> > > > have far-reaching consequences for Streams, Connect, and MM2. We
> probably
> > > > want to build awareness of this dependency graph into any partial CI
> > > logic
> > > > we add, but if we do opt for that, then this change would
> > > > disproportionately benefit downstream modules (Streams, Connect,
> MM2),
> > > and
> > > > have little to no benefit for upstream ones (clients and at least
> some
> > > core
> > > > modules).
> > > >
> > > > With regards to faster iteration times--I agree that it would be
> nice if
> > > > our CI builds didn't take 2-3 hours, but people should already be
> making
> > > > sure that tests are running locally before they push changes (or, if
> they
> > > > really want, they can run tests locally after pushing changes). And
> if
> > > > rapid iteration is necessary, it's always (or at least for the
> > > foreseeable
> > > > future) going to be faster to run whatever specific tests or build
> tasks
> > > > you need to run locally, instead of pushing to GitHub and waiting for
> > > > Jenkins to check for you.
> > > >
> > > > Finally, since there are a number of existing flaky tests on trunk,
> what
> > > > would the strategy be for handling those? Do we try to get to a green
> > > state
> > > > on a per-module basis (possibly with awareness of downstream
> modules) as
> > > > quickly as possible, and then selectively enable partial builds once
> we
> > > > feel confident that flakiness has been addressed?
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Wed, Jun 7, 2023 at 5:09 AM Gaurav Narula <ka...@gnarula.com>
> wrote:
> > > >
> > > > > Hey Greg,
> > > > >
> > > > > Thanks for sharing this idea!
> > > > >
> > > > > The idea of building and testing a relevant subset of code
> certainly
> > > seems
> > > > > interesting.
> > > > >
> > > > > Perhaps this is a good fit for Bazel [1] where
> > > > > target-determinator [2] can be used to to find a subset of targets
> that
> > > > > have changed between two commits.
> > > > >
> > > > > Even without [2], Bazel builds can benefit immensely from
> distributing
> > > > > builds
> > > > > to a set of remote nodes [3] with support for caching previously
> built
> > > > > targets [4].
> > > > >
> > > > > We've seen a few other ASF projects adopt Bazel as well:
> > > > >
> > > > > * https://github.com/apache/rocketmq
> > > > > * https://github.com/apache/brpc
> > > > > * https://github.com/apache/trafficserver
> > > > > * https://github.com/apache/ws-axiom
> > > > >
> > > > > I wonder how the Kafka community feels about experimenting with
> Bazel
> > > and
> > > > > exploring if it helps us offer faster build times without
> compromising
> > > on
> > > > > the
> > > > > correctness of the targets that need to be built and tested?
> > > > >
> > > > > Thanks,
> > > > > Gaurav
> > > > >
> > > > > [1]: https://bazel.build
> > > > > [2]: https://github.com/bazel-contrib/target-determinator
> > > > > [3]: https://bazel.build/remote/rbe
> > > > > [4]: https://bazel.build/remote/caching
> > > > >
> > > > > On 2023/06/05 17:47:07 Greg Harris wrote:
> > > > > > Hey all,
> > > > > >
> > > > > > I've been working on test flakiness recently, and I've been
> trying to
> > > > > > come up with ways to tackle the issue top-down as well as
> bottom-up,
> > > > > > and I'm interested to hear your thoughts on an idea.
> > > > > >
> > > > > > In addition to the current full-suite runs, can we in parallel
> > > trigger
> > > > > > a smaller test run which has only a relevant subset of tests? For
> > > > > > example, if someone is working on one sub-module, the CI would
> only
> > > > > > run tests in that module.
> > > > > >
> > > > > > I think this would be more likely to pass than the full suite
> due to
> > > > > > the fewer tests failing probabilistically, and would improve the
> > > > > > signal-to-noise ratio of the summary pass/fail marker on GitHub.
> This
> > > > > > should also be shorter to execute than the full suite, allowing
> for
> > > > > > faster cycle-time than the current full suite encourages.
> > > > > >
> > > > > > This would also strengthen the incentive for contributors
> > > specializing
> > > > > > in a module to de-flake tests, as they are rewarded with a
> tangible
> > > > > > improvement within their area of the project. Currently, even the
> > > > > > modules with the most reliable tests receive consistent CI
> failures
> > > > > > from other less reliable modules.
> > > > > >
> > > > > > I believe this is possible, even if there isn't an off-the-shelf
> > > > > > solution for it. We can learn of the changed files via a git
> diff,
> > > map
> > > > > > that to modules containing those files, and then execute the
> tests
> > > > > > just for those modules with gradle. GitHub also permits showing
> > > > > > multiple "checks" so that we can emit both the full-suite and
> partial
> > > > > > test results.
> > > > > >
> > > > > > Thanks,
> > > > > > Greg
> > > > > >
> > >
> >
> >
> > --
> > David Arthur
>


-- 
-David

Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

Reply via email to