Hi all, There has been some great progress in fixing the flaky tests. It seems that there's more stability in the builds after more fixes have been merged to master. This work has an impact. Thank you for the contributions.
Our work is not over. There's a lot more to fix. Please continue contributing to make Pulsar CI better. Here's the list of open issues: https://github.com/apache/pulsar/issues?q=is%3Aissue+is%3Aopen+Flaky-test+sort%3Aupdated-desc As usual, please comment on the issue to assign it to yourself. You can join Pulsar Slack's #testing channel to share tips & tricks around fixing the flaky tests or for asking questions. Keep up the good work! BR, Lari On Wed, Feb 3, 2021 at 9:07 PM Lari Hotari <lari.hot...@sagire.fi> wrote: > Hi all, > > Here's the next batch of flaky test issues: > > #9459 Flaky-test: PulsarFunctionsTest.testDebeziumPostgreSqlSource > <https://github.com/apache/pulsar/issues/9459> > > #9458 Flaky-test: ReplicatorTest.testReplication > <https://github.com/apache/pulsar/issues/9458> > > #9457 Flaky-test:ReplicatorTest.testReplicatorOnPartitionedTopic > <https://github.com/apache/pulsar/issues/9457> > > #9456 Flaky-test: TestProxy <https://github.com/apache/pulsar/issues/9456> > > #9455 Flaky-test: PulsarFunctionsTest.testCustomSerdeFunction > <https://github.com/apache/pulsar/issues/9455> > > #9454 Flaky-test: CLITest.testCreateSubscriptionCommand > <https://github.com/apache/pulsar/issues/9454> > > #9453 Flaky-test: PulsarFunctionsProcessTest.testAvroSchemaFunction > <https://github.com/apache/pulsar/issues/9453> > > #9452 Flaky-test: org.apache.pulsar.tests.integration.SmokeTest.setup > <https://github.com/apache/pulsar/issues/9452> > > #9451 Flaky-test: > SimpleProducerConsumerTest.testConcurrentConsumerReceiveWhileReconnect > <https://github.com/apache/pulsar/issues/9451> #9450 Flaky-test: > org.apache.pulsar.broker.service.ReplicatorTest.testResetCursorNotFail > <https://github.com/apache/pulsar/issues/9450> > > The ReplicatorTest ( > https://github.com/apache/pulsar/blob/master/pulsar-broker/src/test/java/org/apache/pulsar/broker/service/ReplicatorTest.java) > is contributing to a lot of failures, here's a complete list of example > failures: https://gist.github.com/lhotari/ff58a94ef42bc6ed41165ed10c7d1cfd > . It would be one of the fixes that would have really great impact. I filed > 2 issues about ReplicatorTest. > > Keep up the good work in fixing flaky tests. There's again a lot of great > contributions. Thank you! > > BR, Lari > > > > > On Wed, Feb 3, 2021 at 6:35 AM Lari Hotari <lari.hot...@sagire.fi> wrote: > >> Hi all, >> >> There are links to recent failures of a particular flaky test in the >> recently reported flaky test GitHub issues ( >> https://github.com/apache/pulsar/issues?q=flaky+sort%3Aupdated-desc+is%3Aissue+is%3Aopen >> ). >> >> Example from https://github.com/apache/pulsar/issues/9437 : >> example failure 2021-02-01T09:41:10.0922161Z >> <https://github.com/apache/pulsar/runs/1804628430?check_suite_focus=true#step:6:12322> >> example failure 2021-01-29T07:51:57.9989389Z >> <https://github.com/apache/pulsar/runs/1789838309?check_suite_focus=true#step:6:18491> >> example failure 2021-01-28T02:42:14.3316285Z >> <https://github.com/apache/pulsar/runs/1781184081?check_suite_focus=true#step:6:18415> >> example failure 2021-01-27T21:44:09.7619772Z >> <https://github.com/apache/pulsar/runs/1778470820?check_suite_focus=true#step:6:6213> >> >> These links point to the exact line in the build log. >> For example: >> https://github.com/apache/pulsar/runs/1804628430?check_suite_focus=true# >> *step:6:12322* >> >> *When opening this link, it should navigate directly to the line number >> 12322 in step 6 of the workflow run log.* >> >> However, there's a bug in the GitHub UI, that this doesn't work if >> the link is clicked from a page within github.com . >> The parameters and hash of the URL get lost and the focus doesn't go to >> the line where the error happened. >> >> *The workaround is to open the "example failure" links in a new >> tab/window by CTRL-click (Windows, Linux) or CMD-click (macOS).* >> >> I hope this helps investigate the flaky test failures more efficiently! >> >> BR, >> >> Lari >> >> On Tue, Feb 2, 2021 at 7:35 PM Lari Hotari <lari.hot...@sagire.fi> wrote: >> >>> The good progress continues! >>> One way to see the issue & PR activity where "flaky" is mentioned: >>> https://github.com/apache/pulsar/issues?q=flaky+sort%3Aupdated-desc >>> Thank you to the contributors and PR reviewers! >>> >>> Here's the next flaky test for someone to fix: >>> https://github.com/apache/pulsar/issues/6646 (reported a long time ago, >>> I added some example of recent failures) >>> It's about PulsarFunctionsTest. This test class contributes to a lot of >>> failures. I have uploaded a list of failures to >>> https://gist.github.com/lhotari/9bae3e16674c297a6bbc2b4831515a74 . >>> I haven't validated that all failures are from flaky test runs. It's >>> possible that some are from a build which broke the test. >>> >>> 1) Who could pick up fixing the multiple issues in PulsarFunctionsTest, >>> https://github.com/apache/pulsar/issues/6646 ? You can comment directly >>> on issue #6646 and start working on it if you wish. It would be a really >>> important fix to have. >>> >>> 2) Another one: https://github.com/apache/pulsar/issues/9431 >>> >>> 3) The 3rd one might be a quick fix, it's a NPE in cleanup: >>> https://github.com/apache/pulsar/issues/9432 >>> >>> I'm looking for the sprinting to continue. It seems that the issues get >>> fixed sooner than I can report more of them. :) >>> >>> BR, Lari >>> >>> >>> On Mon, Feb 1, 2021 at 8:18 PM Lari Hotari <lari.hot...@sagire.fi> >>> wrote: >>> >>>> Dear Pulsar community members, >>>> >>>> Thanks for picking up the work so quickly! I noticed that at least >>>> Renkai and Michael already pushed pull requests to fix the flaky tests that >>>> were mentioned in the previous email. Some of the PRs have already been >>>> merged. >>>> >>>> Here are 3 more flaky tests with links to a lot of example failures: >>>> https://github.com/apache/pulsar/issues/9407 >>>> https://github.com/apache/pulsar/issues/9408 >>>> https://github.com/apache/pulsar/issues/9409 >>>> >>>> I'll report more flaky tests tomorrow. Today I was working on some >>>> tooling to mine the logs and gather some statistics. >>>> >>>> I parsed the logs of the few last days and these are the test methods >>>> that fail the most: >>>> >>>> 273 >>>> org.apache.pulsar.tests.integration.utils.DockerUtils$2.onComplete >>>> 102 org.apache.pulsar.compaction.CompactionTest.cleanup >>>> 81 org.apache.pulsar.admin.cli.PulsarAdminToolTest.topics >>>> 51 >>>> org.apache.pulsar.broker.loadbalance.LoadBalancerTest.testLeaderElection >>>> 45 org.apache.pulsar.io.PulsarFunctionE2ETest.shutdown >>>> 40 >>>> >>>> org.apache.pulsar.broker.service.ConsumedLedgersTrimTest.testConsumedLedgersTrimNoSubscriptions >>>> 36 >>>> org.apache.pulsar.websocket.proxy.ProxyPublishConsumeTest.cleanup >>>> 30 >>>> org.apache.pulsar.functions.worker.PulsarFunctionLocalRunTest.shutdown >>>> 30 >>>> >>>> org.apache.pulsar.tests.integration.functions.PulsarFunctionsTest.testJavaExclamationTopicPatternFunction >>>> 29 >>>> >>>> org.apache.pulsar.tests.integration.functions.PulsarFunctionsTest.testJavaExclamationFunction >>>> 27 >>>> >>>> org.apache.pulsar.client.api.v1.V1_ProducerConsumerTest.testConcurrentConsumerReceiveWhileReconnect >>>> 26 >>>> >>>> org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.lambda$retryOperation$3 >>>> 22 >>>> org.apache.pulsar.broker.service.ReplicatorTest.testResetCursorNotFail >>>> 22 >>>> >>>> org.apache.pulsar.tests.integration.functions.PulsarFunctionsTest.testJavaLoggingFunction >>>> 21 org.apache.pulsar.tests.integration.SmokeTest.setup >>>> 20 >>>> org.apache.pulsar.client.impl.MessageIdTest.testChecksumReconnection >>>> 20 >>>> >>>> org.apache.pulsar.client.impl.MessageIdTest.testChecksumVersionComptability >>>> 19 >>>> >>>> org.apache.pulsar.tests.integration.functions.PulsarFunctionsTest.testPythonFunctionLocalRun >>>> 19 >>>> >>>> org.apache.pulsar.tests.integration.functions.PulsarFunctionsTest.testAutoSchemaFunction >>>> 14 >>>> >>>> org.apache.pulsar.client.api.SimpleProducerConsumerTest.testConcurrentConsumerReceiveWhileReconnect >>>> 14 >>>> >>>> org.apache.pulsar.broker.service.MessagePublishBufferThrottleTest.testBlockByPublishRateLimiting >>>> 14 >>>> >>>> org.apache.pulsar.tests.integration.functions.PulsarFunctionsTest.testSlidingCountWindowTest >>>> 13 >>>> >>>> org.apache.pulsar.tests.integration.backwardscompatibility.ClientTest2_2.testResetCursorCompatibility >>>> 12 >>>> >>>> org.apache.pulsar.tests.integration.functions.PulsarFunctionsTest.testPythonPublishFunction >>>> 12 >>>> >>>> org.apache.pulsar.tests.integration.functions.PulsarFunctionsTest.testPythonExclamationTopicPatternFunction >>>> 12 >>>> >>>> org.apache.pulsar.tests.integration.functions.PulsarFunctionsTest.testPythonExclamationFunctionWithExtraDeps >>>> 12 >>>> >>>> org.apache.pulsar.tests.integration.functions.PulsarFunctionsTest.testPythonExclamationZipFunction >>>> 12 >>>> >>>> org.apache.pulsar.tests.integration.functions.PulsarFunctionsTest.testPythonFunctionNegAck >>>> 12 >>>> >>>> org.apache.pulsar.tests.integration.functions.PulsarFunctionsTest.testPythonExclamationFunction >>>> 12 org.apache.pulsar.compaction.CompactorTest.cleanup >>>> 12 >>>> >>>> org.apache.pulsar.broker.service.BrokerServiceAutoSubscriptionCreationTest.cleanupTest >>>> 12 >>>> org.apache.pulsar.websocket.proxy.ProxyAuthenticationTest.cleanup >>>> 12 >>>> org.apache.pulsar.websocket.proxy.v1.V1_ProxyAuthenticationTest.cleanup >>>> 12 >>>> >>>> org.apache.pulsar.client.impl.BatchMessageIndexAckTest.testBatchMessageIndexAckForSharedSubscription >>>> 11 >>>> >>>> org.apache.pulsar.tests.integration.functions.PulsarFunctionsTest.testJavaPublishFunction >>>> 11 >>>> >>>> org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup >>>> >>>> I'll report more flaky tests after I have checked that my tooling is >>>> producing correct results. >>>> >>>> For contributing to fix flaky tests, please pick a flaky test for >>>> fixing from the reported ones: >>>> >>>> https://github.com/apache/pulsar/issues?q=flaky+sort%3Aupdated-desc+is%3Aopen >>>> >>>> We can all join the #testing channel on Pulsar Slack to share detailed >>>> tips and tricks while working on fixing flaky tests. >>>> >>>> See you, >>>> >>>> BR, Lari >>>> >>>> >>>> On Fri, Jan 29, 2021 at 8:26 PM Lari Hotari <lari.hot...@sagire.fi> >>>> wrote: >>>> >>>>> Dear Pulsar community members, >>>>> >>>>> In order to improve our CI, we will have to fix the flaky tests. In >>>>> some cases it might be necessary to replace an existing test with a >>>>> redesigned test. >>>>> >>>>> The draft PIP "Changes to flaky test handling" document >>>>> <https://docs.google.com/document/d/10lmn4pW1IsT_8D1ZE0vMjASX0HhjdGdjB794iyScwns/edit?usp=sharing> >>>>> lists >>>>> the top 10 flaky tests. A lot of them have already been address by pull >>>>> requests in the past week or so. >>>>> >>>>> This is the list of recent PRs that fix flaky tests from the top 10 >>>>> flaky tests list: >>>>> https://github.com/apache/pulsar/pull/9286 >>>>> https://github.com/apache/pulsar/pull/9243 >>>>> https://github.com/apache/pulsar/pull/9258 >>>>> https://github.com/apache/pulsar/pull/9356 >>>>> >>>>> These are the GH issues for the remaining ones in the top 10 flaky >>>>> tests list: >>>>> https://github.com/apache/pulsar/issues/6368 >>>>> https://github.com/apache/pulsar/issues/9369 >>>>> https://github.com/apache/pulsar/issues/9368 >>>>> >>>>> If you would like to help to fix flaky tests you can pick one of the >>>>> open issues above. Just add a comment on the issue when you start working >>>>> on it so that we can coordinate activities. >>>>> >>>>> It is also helpful to report a flaky test when you encounter one. I've >>>>> been using this type of template for reporting a flaky test: >>>>> https://gist.github.com/lhotari/a5c67359b362b4f3d8729330d65a2298 . >>>>> The issues #9368 and #9369 have been reported using this template. >>>>> Search for the test name before reporting so that we don't end up with >>>>> duplicates. >>>>> >>>>> The issues #6368, #9369 and #9368 are the 3 next important issues to >>>>> fix. I'm planning to create a more extensive list of the flaky failures so >>>>> that we can target the most flaky ones when we continue fixing the flaky >>>>> tests. I have some scripts in development to assist in mining the Pulsar >>>>> Github Action workflow run logs. >>>>> >>>>> This is a search to find flaky issues in Pulsar GH issues: >>>>> >>>>> https://github.com/apache/pulsar/issues?q=flaky+sort%3Aupdated-desc+is%3Aopen >>>>> >>>>> Looking forward to the contributions for fixing flaky tests, >>>>> >>>>> BR, >>>>> >>>>> Lari >>>>> >>>>