Thanks for bringing this up David.

My primary concern revolves around the possibility that the currently
disabled tests may remain inactive indefinitely. We currently have
unresolved JIRA tickets for flaky tests that have been pending for an
extended period. I am inclined to support the idea of disabling these tests
temporarily and merging changes only when the build is successful, provided
there is a clear plan for re-enabling them in the future.

To address this issue, I propose the following measures:

1\ Foster a supportive environment for new contributors within the
community, encouraging them to take on tickets associated with flaky tests.
This initiative would require individuals familiar with the relevant code
to offer guidance to those undertaking these tasks. Committers should
prioritize reviewing and addressing these tickets within their available
bandwidth. To kickstart this effort, we can publish a list of such tickets
in the community and assign one or more committers the role of a "shepherd"
for each ticket.

2\ Implement a policy to block minor version releases until the Release
Manager (RM) is satisfied that the disabled tests do not result in gaps in
our testing coverage. The RM may rely on Subject Matter Experts (SMEs) in
the specific code areas to provide assurance before giving the green light
for a release.

3\ Set a community-wide goal for 2024 to achieve a stable Continuous
Integration (CI) system. This goal should encompass projects such as
refining our test suite to eliminate flakiness and addressing
infrastructure issues if necessary. By publishing this goal, we create a
shared vision for the community in 2024, fostering alignment on our
objectives. This alignment will aid in prioritizing tasks for community
members and guide reviewers in allocating their bandwidth effectively.

--
Divij Vaidya



On Sun, Nov 12, 2023 at 2:58 AM Justine Olshan <jols...@confluent.io.invalid>
wrote:

> I will say that I have also seen tests that seem to be more flaky
> intermittently. It may be ok for some time and suddenly the CI is
> overloaded and we see issues.
> I have also seen the CI struggling with running out of space recently, so I
> wonder if we can also try to improve things on that front.
>
> FWIW, I noticed, filed, or commented on several flaky test JIRAs last week.
> I'm happy to try to get to green builds, but everyone needs to be on board.
>
> https://issues.apache.org/jira/browse/KAFKA-15529
> https://issues.apache.org/jira/browse/KAFKA-14806
> https://issues.apache.org/jira/browse/KAFKA-14249
> https://issues.apache.org/jira/browse/KAFKA-15798
> https://issues.apache.org/jira/browse/KAFKA-15797
> https://issues.apache.org/jira/browse/KAFKA-15690
> https://issues.apache.org/jira/browse/KAFKA-15699
> https://issues.apache.org/jira/browse/KAFKA-15772
> https://issues.apache.org/jira/browse/KAFKA-15759
> https://issues.apache.org/jira/browse/KAFKA-15760
> https://issues.apache.org/jira/browse/KAFKA-15700
>
> I've also seen that kraft transactions tests often flakily see that the
> producer id is not allocated and times out.
> I can file a JIRA for that too.
>
> Hopefully this is a place we can start from.
>
> Justine
>
> On Sat, Nov 11, 2023 at 11:35 AM Ismael Juma <m...@ismaeljuma.com> wrote:
>
> > On Sat, Nov 11, 2023 at 10:32 AM John Roesler <vvcep...@apache.org>
> wrote:
> >
> > > In other words, I’m biased to think that new flakiness indicates
> > > non-deterministic bugs more often than it indicates a bad test.
> > >
> >
> > My experience is exactly the opposite. As someone who has tracked many of
> > the flaky fixes, the vast majority of the time they are an issue with the
> > test.
> >
> > Ismael
> >
>

Reply via email to