Hi, We have a dashboard already:
[image: image.png] https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=trunk&tests.sortField=FLAKY On Tue, Nov 14, 2023 at 10:41 PM Николай Ижиков <nizhi...@apache.org> wrote: > Hello guys. > > I want to tell you about one more approach to deal with flaky tests. > We adopt this approach in Apache Ignite community, so may be it can be > helpful for Kafka, also. > > TL;DR: Apache Ignite community have a tool that provide a statistic of > tests and can tell if PR introduces new failures. > > Apache Ignite has a many tests. > Latest «Run All» contains around 75k. > Most of test has integration style therefore count of flacky are > significant. > > We build a tool - Team City Bot [1] > That provides a combined statistic of flaky tests [2] > > This tool can compare results of Run All for PR and master. > If all OK one can comment jira ticket with a visa from bot [3] > > Visa is a quality proof of PR for Ignite committers. > And we can sort out most flaky tests and prioritize fixes with the bot > statistic [2] > > TC bot integrated with the Team City only, for now. > But, if Kafka community interested we can try to integrate it with Jenkins. > > [1] https://github.com/apache/ignite-teamcity-bot > [2] https://tcbot2.sbt-ignite-dev.ru/current.html?branch=master&count=10 > [3] > https://issues.apache.org/jira/browse/IGNITE-19950?focusedCommentId=17767394&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17767394 > > > > > 15 нояб. 2023 г., в 09:18, Ismael Juma <m...@ismaeljuma.com> написал(а): > > > > To use the pain analogy, people seem to have really good painkillers and > > hence they somehow don't feel the pain already. ;) > > > > The reality is that important and high quality tests will get fixed. Poor > > quality tests (low signal to noise ratio) might not get fixed and that's > ok. > > > > I'm not opposed to marking the tests as release blockers as a starting > > point, but I'm saying it's fine if people triage them and decide they are > > not blockers. In fact, that has already happened in the past. > > > > Ismael > > > > On Tue, Nov 14, 2023 at 10:02 PM Matthias J. Sax <mj...@apache.org> > wrote: > > > >> I agree on the test gap argument. However, my worry is, if we don't > >> "force the pain", it won't get fixed at all. -- I also know, that we try > >> to find an working approach for many years... > >> > >> My take is that if we disable a test and file a non-blocking Jira, it's > >> basically the same as just deleting the test all together and never talk > >> about it again. -- I believe, this is not want we aim for, but we aim > >> for good test coverage and a way to get these test fixed? > >> > >> Thus IMHO we need some forcing function (either keep the tests and feel > >> the pain on every PR), or disable the test and file a blocker JIRA so > >> the pain surfaces on a release forcing us to do something about it. > >> > >> If there is no forcing function, it basically means we are willing to > >> accept test gaps forever. > >> > >> > >> -Matthias > >> > >> On 11/14/23 9:09 PM, Ismael Juma wrote: > >>> Matthias, > >>> > >>> Flaky tests are worse than useless. I know engineers find it hard to > >>> disable them because of the supposed test gap (I find it hard too), but > >> the > >>> truth is that the test gap is already there! No-one blocks merging PRs > on > >>> flaky tests, but they do get used to ignoring build failures. > >>> > >>> The current approach has been attempted for nearly a decade and it has > >>> never worked. I think we should try something different. > >>> > >>> When it comes to marking flaky tests as release blockers, I don't think > >>> this should be done as a general rule. We should instead assess on a > case > >>> by case basis, same way we do it for bugs. > >>> > >>> Ismael > >>> > >>> On Tue, Nov 14, 2023 at 5:02 PM Matthias J. Sax <mj...@apache.org> > >> wrote: > >>> > >>>> Thanks for starting this discussion David! I totally agree to "no"! > >>>> > >>>> I think there is no excuse whatsoever for merging PRs with compilation > >>>> errors (except an honest mistake for conflicting PRs that got merged > >>>> interleaved). -- Every committer must(!) check the Jenkins status > before > >>>> merging to avoid such an issue. > >>>> > >>>> Similar for actual permanently broken tests. If there is no green > build, > >>>> and the same test failed across multiple Jenkins runs, a committer > >>>> should detect this and cannot merge a PR. > >>>> > >>>> Given the current state of the CI pipeline, it seems possible to get > >>>> green runs, and thus I support the policy (that we actually always > had) > >>>> to only merge if there is at least one green build. If committers got > >>>> sloppy about this, we need to call it out and put a hold on this > >> practice. > >>>> > >>>> (The only exception from the above policy would be a very unstable > >>>> status for which getting a green build is not possible at all, due to > >>>> too many flaky tests -- for this case, I would accept to merge even > >>>> there is no green build, but committer need to manually ensure that > >>>> every test did pass in at least one test run. -- We had this in the > >>>> past, but we I don't think we are in such a bad situation right now). > >>>> > >>>> About disabling tests: I was never a fan of this, because in my > >>>> experience those tests are not fixed any time soon. Especially, > because > >>>> we do not consider such tickets as release blockers. To me, seeing > tests > >>>> fails on PR build is actually a good forcing function for people to > feel > >>>> the pain, and thus get motivated to make time to fix the tests. > >>>> > >>>> I have to admit that I was a little bit sloppy paying attention to > flaky > >>>> tests recently, and I highly appreciate this effort. Also thanks to > >>>> everyone how actually filed a ticket! IMHO, we should file a ticket > for > >>>> every flaky test, and also keep adding comments each time we see a > test > >>>> fail to be able to track the frequency at which a tests fails, so we > can > >>>> fix the most pressing ones first. > >>>> > >>>> Te me, the best forcing function to get test stabilized is to file > >>>> tickets and consider them release blockers. Disabling tests does not > >>>> really help much IMHO to tackle the problem (we can of course still > >>>> disable them to get noise out of the system, but it would only > introduce > >>>> testing gaps for the time being and also does not help to figure out > how > >>>> often a test fails, so it's not a solution to the problem IMHO) > >>>> > >>>> > >>>> -Matthias > >>>> > >>>> On 11/13/23 11:40 PM, Sagar wrote: > >>>>> Hi Divij, > >>>>> > >>>>> I think this proposal overall makes sense. My only nit sort of a > >>>> suggestion > >>>>> is that let's also consider a label called newbie++[1] for flaky > tests > >> if > >>>>> we are considering adding newbie as a label. I think some of the > flaky > >>>>> tests need familiarity with the codebase or the test setup so as a > >> first > >>>>> time contributor, it might be difficult. newbie++ IMO covers that > >> aspect. > >>>>> > >>>>> [1] > >>>>> > >>>> > >> > https://issues.apache.org/jira/browse/KAFKA-15406?jql=project%20%3D%20KAFKA%20AND%20labels%20%3D%20%22newbie%2B%2B%22 > >>>>> > >>>>> Let me know what you think. > >>>>> > >>>>> Thanks! > >>>>> Sagar. > >>>>> > >>>>> On Mon, Nov 13, 2023 at 9:11 PM Divij Vaidya < > divijvaidy...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>>> Please, do it. > >>>>>> We can use specific labels to effectively filter those tickets. > >>>>>> > >>>>>> We already have a label and a way to discover flaky tests. They are > >>>> tagged > >>>>>> with the label "flaky-test" [1]. There is also a label "newbie" [2] > >>>> meant > >>>>>> for folks who are new to Apache Kafka code base. > >>>>>> My suggestion is to send a broader email to the community (since > many > >>>> will > >>>>>> miss details in this thread) and call for action for committers to > >>>>>> volunteer as "shepherds" for these tickets. I can send one out once > we > >>>> have > >>>>>> some consensus wrt next steps in this thread. > >>>>>> > >>>>>> > >>>>>> [1] > >>>>>> > >>>>>> > >>>> > >> > https://issues.apache.org/jira/browse/KAFKA-13421?jql=project%20%3D%20KAFKA%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flaky-test%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC > >>>>>> > >>>>>> > >>>>>> [2] https://kafka.apache.org/contributing -> Finding a project to > >> work > >>>> on > >>>>>> > >>>>>> > >>>>>> Divij Vaidya > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Mon, Nov 13, 2023 at 4:24 PM Николай Ижиков <nizhi...@apache.org > > > >>>>>> wrote: > >>>>>> > >>>>>>> > >>>>>>>> To kickstart this effort, we can publish a list of such tickets in > >> the > >>>>>>> community and assign one or more committers the role of a > «shepherd" > >>>> for > >>>>>>> each ticket. > >>>>>>> > >>>>>>> Please, do it. > >>>>>>> We can use specific label to effectively filter those tickets. > >>>>>>> > >>>>>>>> 13 нояб. 2023 г., в 15:16, Divij Vaidya <divijvaidy...@gmail.com> > >>>>>>> написал(а): > >>>>>>>> > >>>>>>>> Thanks for bringing this up David. > >>>>>>>> > >>>>>>>> My primary concern revolves around the possibility that the > >> currently > >>>>>>>> disabled tests may remain inactive indefinitely. We currently have > >>>>>>>> unresolved JIRA tickets for flaky tests that have been pending for > >> an > >>>>>>>> extended period. I am inclined to support the idea of disabling > >> these > >>>>>>> tests > >>>>>>>> temporarily and merging changes only when the build is successful, > >>>>>>> provided > >>>>>>>> there is a clear plan for re-enabling them in the future. > >>>>>>>> > >>>>>>>> To address this issue, I propose the following measures: > >>>>>>>> > >>>>>>>> 1\ Foster a supportive environment for new contributors within the > >>>>>>>> community, encouraging them to take on tickets associated with > flaky > >>>>>>> tests. > >>>>>>>> This initiative would require individuals familiar with the > relevant > >>>>>> code > >>>>>>>> to offer guidance to those undertaking these tasks. Committers > >> should > >>>>>>>> prioritize reviewing and addressing these tickets within their > >>>>>> available > >>>>>>>> bandwidth. To kickstart this effort, we can publish a list of such > >>>>>>> tickets > >>>>>>>> in the community and assign one or more committers the role of a > >>>>>>> "shepherd" > >>>>>>>> for each ticket. > >>>>>>>> > >>>>>>>> 2\ Implement a policy to block minor version releases until the > >>>> Release > >>>>>>>> Manager (RM) is satisfied that the disabled tests do not result in > >>>> gaps > >>>>>>> in > >>>>>>>> our testing coverage. The RM may rely on Subject Matter Experts > >> (SMEs) > >>>>>> in > >>>>>>>> the specific code areas to provide assurance before giving the > green > >>>>>>> light > >>>>>>>> for a release. > >>>>>>>> > >>>>>>>> 3\ Set a community-wide goal for 2024 to achieve a stable > Continuous > >>>>>>>> Integration (CI) system. This goal should encompass projects such > as > >>>>>>>> refining our test suite to eliminate flakiness and addressing > >>>>>>>> infrastructure issues if necessary. By publishing this goal, we > >> create > >>>>>> a > >>>>>>>> shared vision for the community in 2024, fostering alignment on > our > >>>>>>>> objectives. This alignment will aid in prioritizing tasks for > >>>> community > >>>>>>>> members and guide reviewers in allocating their bandwidth > >> effectively. > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Divij Vaidya > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Sun, Nov 12, 2023 at 2:58 AM Justine Olshan > >>>>>>> <jols...@confluent.io.invalid> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> I will say that I have also seen tests that seem to be more flaky > >>>>>>>>> intermittently. It may be ok for some time and suddenly the CI is > >>>>>>>>> overloaded and we see issues. > >>>>>>>>> I have also seen the CI struggling with running out of space > >>>> recently, > >>>>>>> so I > >>>>>>>>> wonder if we can also try to improve things on that front. > >>>>>>>>> > >>>>>>>>> FWIW, I noticed, filed, or commented on several flaky test JIRAs > >> last > >>>>>>> week. > >>>>>>>>> I'm happy to try to get to green builds, but everyone needs to be > >> on > >>>>>>> board. > >>>>>>>>> > >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15529 > >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-14806 > >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-14249 > >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15798 > >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15797 > >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15690 > >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15699 > >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15772 > >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15759 > >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15760 > >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15700 > >>>>>>>>> > >>>>>>>>> I've also seen that kraft transactions tests often flakily see > that > >>>>>> the > >>>>>>>>> producer id is not allocated and times out. > >>>>>>>>> I can file a JIRA for that too. > >>>>>>>>> > >>>>>>>>> Hopefully this is a place we can start from. > >>>>>>>>> > >>>>>>>>> Justine > >>>>>>>>> > >>>>>>>>> On Sat, Nov 11, 2023 at 11:35 AM Ismael Juma <m...@ismaeljuma.com> > >>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> On Sat, Nov 11, 2023 at 10:32 AM John Roesler < > >> vvcep...@apache.org> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> In other words, I’m biased to think that new flakiness > indicates > >>>>>>>>>>> non-deterministic bugs more often than it indicates a bad test. > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> My experience is exactly the opposite. As someone who has > tracked > >>>>>> many > >>>>>>> of > >>>>>>>>>> the flaky fixes, the vast majority of the time they are an issue > >>>> with > >>>>>>> the > >>>>>>>>>> test. > >>>>>>>>>> > >>>>>>>>>> Ismael > >>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > >