Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

Greg Harris Thu, 08 Jun 2023 11:01:19 -0700

Gaurav,

The target-determinator is certainly the "off-the-shelf" solution I
expected would be out there. If the project migrates to Bazel I think
that would make the partial builds much easier to implement.
I think we should look into the other benefits of migrating to bazel
to see if it is worth it even if the partial builds feature is decided
against, or after it is reverted.


Chris,

> Do you think we should aim to disable
> merges without a full suite of passing CI runs (allowing for administrative
> override in an emergency)? If so, what would the path be from our current
> state to there? What can we do to ensure that we don't get stuck relying on
> a once-temporary aid that becomes effectively permanent?

Yes I think it would be nice to require a green build to merge,
without excessive merge-queue infrastructure that is more common in
high-volume monorepos.

1. We would decide on a sunset date on the partial build indicator, at
which point it will be disabled on trunk and all release branches. I
suspect that a sunset date could be set for 1-3 releases in the
future.
2. We would enable partial builds. The merge-button would remain
as-is, with the partial build and full-suite indicators both being
visible.
3. Communication would be sent out to explain that the partial build
indicators are a temporary flakiness reduction tool.
4. Committers would continue to rely on the full-suite indicators, and
watch the partial build to get a sense of the level of flakiness in
each module.
5. On new contributions which have failing partial builds, Committers
can ask that contributors investigate and follow-up on flaky failures
that appeared in their PR, explaining that they can help make that
indicator green for future contributions.
6. After the sunset date passes, the partial build indicators will be disabled.
7. If the main trunk build is sufficiently reliable by some criteria
(e.g. 20 passes in a row) we can discuss requiring a green build for
the GitHub merge button.

> We probably
> want to build awareness of this dependency graph into any partial CI logic
> we add, but if we do opt for that, then this change would
> disproportionately benefit downstream modules (Streams, Connect, MM2), and
> have little to no benefit for upstream ones (clients and at least some core
> modules).

I considered that, and I think it is more beneficial to provide
equitable partial builds than ones with perfect coverage. If you want
to know about cross-module failures, I think that the full suite is
still the appropriate tool to detect those.
>From another perspective, if this change is only useful to `core`
after all other dependent modules have addressed their flakiness, then
it delays `core` contributors from the reward for addressing their
flakiness.
And if anything, making the partial build cover fewer failure modes
and reminding people of that fact will keep the full-suite builds
relevant.

> but people should already be making
> sure that tests are running locally before they push changes

I agree, and I don't think that partial builds are in any danger of
replacing local testing for fast synchronous iteration. I also agree
that it is appropriate to give personal feedback to repeat
contributors to improve the quality of their PRs.
CI seems to be for enforcing the lowest-common-denominator on
contributions, and benefiting (potentially first-time) contributors
which do not have the familiarity, mindfulness, or resources to
pre-test each of their contributions.
It is in those situations that I think partial builds can be
beneficial: if there is something mechanical that all contributors
should be doing locally, why not have it done automatically on their
behalf?

> Finally, since there are a number of existing flaky tests on trunk, what
> would the strategy be for handling those? Do we try to get to a green state
> on a per-module basis (possibly with awareness of downstream modules) as
> quickly as possible, and then selectively enable partial builds once we
> feel confident that flakiness has been addressed?

For modules which receive regular PRs, the partial builds would
surface those flakes, incentivising fixing them. Once the per-module
flakiness is under control, committers for those modules can start to
require green partial builds, pushing back on PRs which have obvious
failures.
For modules which do not receive regular PRs, those flakes will appear
in all of the full-suite builds, and wouldn't be addressed by the
partial builds. Those would require either another incentive
structure, or would need volunteers from other modules to help fix
them.
I'm not sure which modules receive the least PR traffic, but I can see
that the current most stale modules are log4j-appender 1 year ago,
streams:examples 6 months ago, connect:file 5 months ago, storage:api
5 months ago, and streams-scala 5 months ago.

Thanks for the feedback all,
Greg

On Wed, Jun 7, 2023 at 7:55 AM Chris Egerton <chr...@aiven.io.invalid> wrote:
>
> Hi Greg,
>
> I can see the point about enabling partial runs as a temporary measure to
> fight flakiness, and it does carry some merit. In that case, though, we
> should have an idea of what the desired end state is once we've stopped
> relying on any temporary measures. Do you think we should aim to disable
> merges without a full suite of passing CI runs (allowing for administrative
> override in an emergency)? If so, what would the path be from our current
> state to there? What can we do to ensure that we don't get stuck relying on
> a once-temporary aid that becomes effectively permanent?
>
> With partial builds, we also need to be careful to make sure to correctly
> handle cross-module dependencies. A tweak to broker or client logic may
> only affect files in one module and pass all tests for that module, but
> have far-reaching consequences for Streams, Connect, and MM2. We probably
> want to build awareness of this dependency graph into any partial CI logic
> we add, but if we do opt for that, then this change would
> disproportionately benefit downstream modules (Streams, Connect, MM2), and
> have little to no benefit for upstream ones (clients and at least some core
> modules).
>
> With regards to faster iteration times--I agree that it would be nice if
> our CI builds didn't take 2-3 hours, but people should already be making
> sure that tests are running locally before they push changes (or, if they
> really want, they can run tests locally after pushing changes). And if
> rapid iteration is necessary, it's always (or at least for the foreseeable
> future) going to be faster to run whatever specific tests or build tasks
> you need to run locally, instead of pushing to GitHub and waiting for
> Jenkins to check for you.
>
> Finally, since there are a number of existing flaky tests on trunk, what
> would the strategy be for handling those? Do we try to get to a green state
> on a per-module basis (possibly with awareness of downstream modules) as
> quickly as possible, and then selectively enable partial builds once we
> feel confident that flakiness has been addressed?
>
> Cheers,
>
> Chris
>
> On Wed, Jun 7, 2023 at 5:09 AM Gaurav Narula <ka...@gnarula.com> wrote:
>
> > Hey Greg,
> >
> > Thanks for sharing this idea!
> >
> > The idea of building and testing a relevant subset of code certainly seems
> > interesting.
> >
> > Perhaps this is a good fit for Bazel [1] where
> > target-determinator [2] can be used to to find a subset of targets that
> > have changed between two commits.
> >
> > Even without [2], Bazel builds can benefit immensely from distributing
> > builds
> > to a set of remote nodes [3] with support for caching previously built
> > targets [4].
> >
> > We've seen a few other ASF projects adopt Bazel as well:
> >
> > * https://github.com/apache/rocketmq
> > * https://github.com/apache/brpc
> > * https://github.com/apache/trafficserver
> > * https://github.com/apache/ws-axiom
> >
> > I wonder how the Kafka community feels about experimenting with Bazel and
> > exploring if it helps us offer faster build times without compromising on
> > the
> > correctness of the targets that need to be built and tested?
> >
> > Thanks,
> > Gaurav
> >
> > [1]: https://bazel.build
> > [2]: https://github.com/bazel-contrib/target-determinator
> > [3]: https://bazel.build/remote/rbe
> > [4]: https://bazel.build/remote/caching
> >
> > On 2023/06/05 17:47:07 Greg Harris wrote:
> > > Hey all,
> > >
> > > I've been working on test flakiness recently, and I've been trying to
> > > come up with ways to tackle the issue top-down as well as bottom-up,
> > > and I'm interested to hear your thoughts on an idea.
> > >
> > > In addition to the current full-suite runs, can we in parallel trigger
> > > a smaller test run which has only a relevant subset of tests? For
> > > example, if someone is working on one sub-module, the CI would only
> > > run tests in that module.
> > >
> > > I think this would be more likely to pass than the full suite due to
> > > the fewer tests failing probabilistically, and would improve the
> > > signal-to-noise ratio of the summary pass/fail marker on GitHub. This
> > > should also be shorter to execute than the full suite, allowing for
> > > faster cycle-time than the current full suite encourages.
> > >
> > > This would also strengthen the incentive for contributors specializing
> > > in a module to de-flake tests, as they are rewarded with a tangible
> > > improvement within their area of the project. Currently, even the
> > > modules with the most reliable tests receive consistent CI failures
> > > from other less reliable modules.
> > >
> > > I believe this is possible, even if there isn't an off-the-shelf
> > > solution for it. We can learn of the changed files via a git diff, map
> > > that to modules containing those files, and then execute the tests
> > > just for those modules with gradle. GitHub also permits showing
> > > multiple "checks" so that we can emit both the full-suite and partial
> > > test results.
> > >
> > > Thanks,
> > > Greg
> > >

Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

Reply via email to