Hi,

We have a dashboard already:

[image: image.png]

https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=trunk&tests.sortField=FLAKY

On Tue, Nov 14, 2023 at 10:41 PM Николай Ижиков <nizhi...@apache.org> wrote:

> Hello guys.
>
> I want to tell you about one more approach to deal with flaky tests.
> We adopt this approach in Apache Ignite community, so may be it can be
> helpful for Kafka, also.
>
> TL;DR: Apache Ignite community have a tool that provide a statistic of
> tests and can tell if PR introduces new failures.
>
> Apache Ignite has a many tests.
> Latest «Run All» contains around 75k.
> Most of test has integration style therefore count of flacky are
> significant.
>
> We build a tool - Team City Bot [1]
> That provides a combined statistic of flaky tests [2]
>
> This tool can compare results of Run All for PR and master.
> If all OK one can comment jira ticket with a visa from bot [3]
>
> Visa is a quality proof of PR for Ignite committers.
> And we can sort out most flaky tests and prioritize fixes with the bot
> statistic [2]
>
> TC bot integrated with the Team City only, for now.
> But, if Kafka community interested we can try to integrate it with Jenkins.
>
> [1] https://github.com/apache/ignite-teamcity-bot
> [2] https://tcbot2.sbt-ignite-dev.ru/current.html?branch=master&count=10
> [3]
> https://issues.apache.org/jira/browse/IGNITE-19950?focusedCommentId=17767394&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17767394
>
>
>
> > 15 нояб. 2023 г., в 09:18, Ismael Juma <m...@ismaeljuma.com> написал(а):
> >
> > To use the pain analogy, people seem to have really good painkillers and
> > hence they somehow don't feel the pain already. ;)
> >
> > The reality is that important and high quality tests will get fixed. Poor
> > quality tests (low signal to noise ratio) might not get fixed and that's
> ok.
> >
> > I'm not opposed to marking the tests as release blockers as a starting
> > point, but I'm saying it's fine if people triage them and decide they are
> > not blockers. In fact, that has already happened in the past.
> >
> > Ismael
> >
> > On Tue, Nov 14, 2023 at 10:02 PM Matthias J. Sax <mj...@apache.org>
> wrote:
> >
> >> I agree on the test gap argument. However, my worry is, if we don't
> >> "force the pain", it won't get fixed at all. -- I also know, that we try
> >> to find an working approach for many years...
> >>
> >> My take is that if we disable a test and file a non-blocking Jira, it's
> >> basically the same as just deleting the test all together and never talk
> >> about it again. -- I believe, this is not want we aim for, but we aim
> >> for good test coverage and a way to get these test fixed?
> >>
> >> Thus IMHO we need some forcing function (either keep the tests and feel
> >> the pain on every PR), or disable the test and file a blocker JIRA so
> >> the pain surfaces on a release forcing us to do something about it.
> >>
> >> If there is no forcing function, it basically means we are willing to
> >> accept test gaps forever.
> >>
> >>
> >> -Matthias
> >>
> >> On 11/14/23 9:09 PM, Ismael Juma wrote:
> >>> Matthias,
> >>>
> >>> Flaky tests are worse than useless. I know engineers find it hard to
> >>> disable them because of the supposed test gap (I find it hard too), but
> >> the
> >>> truth is that the test gap is already there! No-one blocks merging PRs
> on
> >>> flaky tests, but they do get used to ignoring build failures.
> >>>
> >>> The current approach has been attempted for nearly a decade and it has
> >>> never worked. I think we should try something different.
> >>>
> >>> When it comes to marking flaky tests as release blockers, I don't think
> >>> this should be done as a general rule. We should instead assess on a
> case
> >>> by case basis, same way we do it for bugs.
> >>>
> >>> Ismael
> >>>
> >>> On Tue, Nov 14, 2023 at 5:02 PM Matthias J. Sax <mj...@apache.org>
> >> wrote:
> >>>
> >>>> Thanks for starting this discussion David! I totally agree to "no"!
> >>>>
> >>>> I think there is no excuse whatsoever for merging PRs with compilation
> >>>> errors (except an honest mistake for conflicting PRs that got merged
> >>>> interleaved). -- Every committer must(!) check the Jenkins status
> before
> >>>> merging to avoid such an issue.
> >>>>
> >>>> Similar for actual permanently broken tests. If there is no green
> build,
> >>>> and the same test failed across multiple Jenkins runs, a committer
> >>>> should detect this and cannot merge a PR.
> >>>>
> >>>> Given the current state of the CI pipeline, it seems possible to get
> >>>> green runs, and thus I support the policy (that we actually always
> had)
> >>>> to only merge if there is at least one green build. If committers got
> >>>> sloppy about this, we need to call it out and put a hold on this
> >> practice.
> >>>>
> >>>> (The only exception from the above policy would be a very unstable
> >>>> status for which getting a green build is not possible at all, due to
> >>>> too many flaky tests -- for this case, I would accept to merge even
> >>>> there is no green build, but committer need to manually ensure that
> >>>> every test did pass in at least one test run. -- We had this in the
> >>>> past, but we I don't think we are in such a bad situation right now).
> >>>>
> >>>> About disabling tests: I was never a fan of this, because in my
> >>>> experience those tests are not fixed any time soon. Especially,
> because
> >>>> we do not consider such tickets as release blockers. To me, seeing
> tests
> >>>> fails on PR build is actually a good forcing function for people to
> feel
> >>>> the pain, and thus get motivated to make time to fix the tests.
> >>>>
> >>>> I have to admit that I was a little bit sloppy paying attention to
> flaky
> >>>> tests recently, and I highly appreciate this effort. Also thanks to
> >>>> everyone how actually filed a ticket! IMHO, we should file a ticket
> for
> >>>> every flaky test, and also keep adding comments each time we see a
> test
> >>>> fail to be able to track the frequency at which a tests fails, so we
> can
> >>>> fix the most pressing ones first.
> >>>>
> >>>> Te me, the best forcing function to get test stabilized is to file
> >>>> tickets and consider them release blockers. Disabling tests does not
> >>>> really help much IMHO to tackle the problem (we can of course still
> >>>> disable them to get noise out of the system, but it would only
> introduce
> >>>> testing gaps for the time being and also does not help to figure out
> how
> >>>> often a test fails, so it's not a solution to the problem IMHO)
> >>>>
> >>>>
> >>>> -Matthias
> >>>>
> >>>> On 11/13/23 11:40 PM, Sagar wrote:
> >>>>> Hi Divij,
> >>>>>
> >>>>> I think this proposal overall makes sense. My only nit sort of a
> >>>> suggestion
> >>>>> is that let's also consider a label called newbie++[1] for flaky
> tests
> >> if
> >>>>> we are considering adding newbie as a label. I think some of the
> flaky
> >>>>> tests need familiarity with the codebase or the test setup so as a
> >> first
> >>>>> time contributor, it might be difficult. newbie++ IMO covers that
> >> aspect.
> >>>>>
> >>>>> [1]
> >>>>>
> >>>>
> >>
> https://issues.apache.org/jira/browse/KAFKA-15406?jql=project%20%3D%20KAFKA%20AND%20labels%20%3D%20%22newbie%2B%2B%22
> >>>>>
> >>>>> Let me know what you think.
> >>>>>
> >>>>> Thanks!
> >>>>> Sagar.
> >>>>>
> >>>>> On Mon, Nov 13, 2023 at 9:11 PM Divij Vaidya <
> divijvaidy...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>>>   Please, do it.
> >>>>>> We can use specific labels to effectively filter those tickets.
> >>>>>>
> >>>>>> We already have a label and a way to discover flaky tests. They are
> >>>> tagged
> >>>>>> with the label "flaky-test" [1]. There is also a label "newbie" [2]
> >>>> meant
> >>>>>> for folks who are new to Apache Kafka code base.
> >>>>>> My suggestion is to send a broader email to the community (since
> many
> >>>> will
> >>>>>> miss details in this thread) and call for action for committers to
> >>>>>> volunteer as "shepherds" for these tickets. I can send one out once
> we
> >>>> have
> >>>>>> some consensus wrt next steps in this thread.
> >>>>>>
> >>>>>>
> >>>>>> [1]
> >>>>>>
> >>>>>>
> >>>>
> >>
> https://issues.apache.org/jira/browse/KAFKA-13421?jql=project%20%3D%20KAFKA%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flaky-test%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
> >>>>>>
> >>>>>>
> >>>>>> [2] https://kafka.apache.org/contributing -> Finding a project to
> >> work
> >>>> on
> >>>>>>
> >>>>>>
> >>>>>> Divij Vaidya
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Nov 13, 2023 at 4:24 PM Николай Ижиков <nizhi...@apache.org
> >
> >>>>>> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>> To kickstart this effort, we can publish a list of such tickets in
> >> the
> >>>>>>> community and assign one or more committers the role of a
> «shepherd"
> >>>> for
> >>>>>>> each ticket.
> >>>>>>>
> >>>>>>> Please, do it.
> >>>>>>> We can use specific label to effectively filter those tickets.
> >>>>>>>
> >>>>>>>> 13 нояб. 2023 г., в 15:16, Divij Vaidya <divijvaidy...@gmail.com>
> >>>>>>> написал(а):
> >>>>>>>>
> >>>>>>>> Thanks for bringing this up David.
> >>>>>>>>
> >>>>>>>> My primary concern revolves around the possibility that the
> >> currently
> >>>>>>>> disabled tests may remain inactive indefinitely. We currently have
> >>>>>>>> unresolved JIRA tickets for flaky tests that have been pending for
> >> an
> >>>>>>>> extended period. I am inclined to support the idea of disabling
> >> these
> >>>>>>> tests
> >>>>>>>> temporarily and merging changes only when the build is successful,
> >>>>>>> provided
> >>>>>>>> there is a clear plan for re-enabling them in the future.
> >>>>>>>>
> >>>>>>>> To address this issue, I propose the following measures:
> >>>>>>>>
> >>>>>>>> 1\ Foster a supportive environment for new contributors within the
> >>>>>>>> community, encouraging them to take on tickets associated with
> flaky
> >>>>>>> tests.
> >>>>>>>> This initiative would require individuals familiar with the
> relevant
> >>>>>> code
> >>>>>>>> to offer guidance to those undertaking these tasks. Committers
> >> should
> >>>>>>>> prioritize reviewing and addressing these tickets within their
> >>>>>> available
> >>>>>>>> bandwidth. To kickstart this effort, we can publish a list of such
> >>>>>>> tickets
> >>>>>>>> in the community and assign one or more committers the role of a
> >>>>>>> "shepherd"
> >>>>>>>> for each ticket.
> >>>>>>>>
> >>>>>>>> 2\ Implement a policy to block minor version releases until the
> >>>> Release
> >>>>>>>> Manager (RM) is satisfied that the disabled tests do not result in
> >>>> gaps
> >>>>>>> in
> >>>>>>>> our testing coverage. The RM may rely on Subject Matter Experts
> >> (SMEs)
> >>>>>> in
> >>>>>>>> the specific code areas to provide assurance before giving the
> green
> >>>>>>> light
> >>>>>>>> for a release.
> >>>>>>>>
> >>>>>>>> 3\ Set a community-wide goal for 2024 to achieve a stable
> Continuous
> >>>>>>>> Integration (CI) system. This goal should encompass projects such
> as
> >>>>>>>> refining our test suite to eliminate flakiness and addressing
> >>>>>>>> infrastructure issues if necessary. By publishing this goal, we
> >> create
> >>>>>> a
> >>>>>>>> shared vision for the community in 2024, fostering alignment on
> our
> >>>>>>>> objectives. This alignment will aid in prioritizing tasks for
> >>>> community
> >>>>>>>> members and guide reviewers in allocating their bandwidth
> >> effectively.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Divij Vaidya
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sun, Nov 12, 2023 at 2:58 AM Justine Olshan
> >>>>>>> <jols...@confluent.io.invalid>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> I will say that I have also seen tests that seem to be more flaky
> >>>>>>>>> intermittently. It may be ok for some time and suddenly the CI is
> >>>>>>>>> overloaded and we see issues.
> >>>>>>>>> I have also seen the CI struggling with running out of space
> >>>> recently,
> >>>>>>> so I
> >>>>>>>>> wonder if we can also try to improve things on that front.
> >>>>>>>>>
> >>>>>>>>> FWIW, I noticed, filed, or commented on several flaky test JIRAs
> >> last
> >>>>>>> week.
> >>>>>>>>> I'm happy to try to get to green builds, but everyone needs to be
> >> on
> >>>>>>> board.
> >>>>>>>>>
> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15529
> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-14806
> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-14249
> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15798
> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15797
> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15690
> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15699
> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15772
> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15759
> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15760
> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15700
> >>>>>>>>>
> >>>>>>>>> I've also seen that kraft transactions tests often flakily see
> that
> >>>>>> the
> >>>>>>>>> producer id is not allocated and times out.
> >>>>>>>>> I can file a JIRA for that too.
> >>>>>>>>>
> >>>>>>>>> Hopefully this is a place we can start from.
> >>>>>>>>>
> >>>>>>>>> Justine
> >>>>>>>>>
> >>>>>>>>> On Sat, Nov 11, 2023 at 11:35 AM Ismael Juma <m...@ismaeljuma.com>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> On Sat, Nov 11, 2023 at 10:32 AM John Roesler <
> >> vvcep...@apache.org>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> In other words, I’m biased to think that new flakiness
> indicates
> >>>>>>>>>>> non-deterministic bugs more often than it indicates a bad test.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> My experience is exactly the opposite. As someone who has
> tracked
> >>>>>> many
> >>>>>>> of
> >>>>>>>>>> the flaky fixes, the vast majority of the time they are an issue
> >>>> with
> >>>>>>> the
> >>>>>>>>>> test.
> >>>>>>>>>>
> >>>>>>>>>> Ismael
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Reply via email to