I like: *Include ignored or quarantined tests in the release notes* *Run flaky tests only in postcommit* (related? *Separate flaky tests into quarantine job*) *Require link to Jira to rerun a test*
I am concerned about: *Add Gradle or Jenkins plugin to retry flaky tests* - because it is a convenient place for real bugs to hide. I do not know much about: *Consider Gradle Enterprise* https://testautonation.com/analyse-test-results-deflake-flaky-tests/ Thank you for putting this list! I believe even if we can commit to doing some of these we would have a much healthier project. If we can build consensus on implementing, I will be happy to work on some of them. On Fri, Jul 24, 2020 at 1:54 PM Kenneth Knowles <k...@apache.org> wrote: > Adding > https://testautonation.com/analyse-test-results-deflake-flaky-tests/ to > the list which seems a more powerful test history tool. > > On Fri, Jul 24, 2020 at 1:51 PM Kenneth Knowles <k...@apache.org> wrote: > >> Had some off-list chats to brainstorm and I wanted to bring ideas back to >> the dev@ list for consideration. A lot can be combined. I would really >> like to have a section in the release notes. I like the idea of banishing >> flakes from pre-commit (since you can't tell easily if it was a real >> failure caused by the PR) and auto-retrying in post-commit (so we can >> gather data on exactly what is flaking without a lot of manual >> investigation). >> >> *Include ignored or quarantined tests in the release notes* >> Pro: >> - Users are aware of what is not being tested so may be silently broken >> - It forces discussion of ignored tests to be part of our community >> processes >> Con: >> - It may look bad if the list is large (this is actually also a Pro >> because if it looks bad, it is bad) >> >> *Run flaky tests only in postcommit* >> Pro: >> - isolates the bad signal so pre-commit is not affected >> - saves pointless re-runs in pre-commit >> - keeps a signal in post-commit that we can watch, instead of losing it >> completely when we disable a test >> - maybe keeps the flaky tests in job related to what they are testing >> Con: >> - we have to really watch post-commit or flakes can turn into failures >> >> *Separate flaky tests into quarantine job* >> Pro: >> - gain signal for healthy tests, as with disabling or running in >> post-commit >> - also saves pointless re-runs >> Con: >> - may collect bad tests so that we never look at it so it is the same as >> disabling the test >> - lots of unrelated tests grouped into signal instead of focused on >> health of a particular component >> >> *Add Gradle or Jenkins plugin to retry flaky tests* >> https://blog.gradle.org/gradle-flaky-test-retry-plugin >> https://plugins.jenkins.io/flaky-test-handler/ >> Pro: >> - easier than Jiras with human pasting links; works with moving flakes >> to post-commit >> - get a somewhat automated view of flakiness, whether in pre-commit or >> post-commit >> - don't get stopped by flakiness >> Con: >> - maybe too easy to ignore flakes; we should add all flakes (not just >> disabled or quarantined) to the release notes >> - sometimes flakes are actual bugs (like concurrency) so treating this >> as OK is not desirable >> - without Jiras, no automated release notes >> - Jenkins: retry only will work at job level because it needs Maven to >> retry only failed (I think) >> - Jenkins: some of our jobs may have duplicate test names (but might >> already be fixed) >> >> *Consider Gradle Enterprise* >> Pro: >> - get Gradle scan granularity of flake data (and other stuff) >> - also gives module-level health which we do not have today >> Con: >> - cost and administrative burden unknown >> - we probably have to do some small work to make our jobs compatible >> with their history tracking >> >> *Require link to Jira to rerun a test* >> Instead of saying "Run Java PreCommit" you have to link to the bug >> relating to the failure. >> Pro: >> - forces investigation >> - helps others find out about issues >> Con: >> - adds a lot of manual work, or requires automation (which will probably >> be ad hoc and fragile) >> >> Kenn >> >> On Mon, Jul 20, 2020 at 11:59 AM Brian Hulette <bhule...@google.com> >> wrote: >> >>> > I think we are missing a way for checking that we are making progress >>> on P1 issues. For example, P0 issues block releases and this obviously >>> results in fixing/triaging/addressing P0 issues at least every 6 weeks. We >>> do not have a similar process for flaky tests. I do not know what would be >>> a good policy. One suggestion is to ping (email/slack) assignees of issues. >>> I recently missed a flaky issue that was assigned to me. A ping like that >>> would have reminded me. And if an assignee cannot help/does not have the >>> time, we can try to find a new assignee. >>> >>> Yeah I think this is something we should address. With the new jira >>> automation at least assignees should get an email notification after 30 >>> days because of a jira comment like [1], but that's too long to let a test >>> continue to flake. Could Beam Jira Bot ping every N days for P1s that >>> aren't making progress? >>> >>> That wouldn't help us with P1s that have no assignee, or are assigned to >>> overloaded people. It seems we'd need some kind of dashboard or report to >>> capture those. >>> >>> [1] >>> https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918 >>> >>> On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay <al...@google.com> wrote: >>> >>>> Another idea, could we change our "Retest X" phrases with "Retest X >>>> (Reason)" phrases? With this change a PR author will have to look at failed >>>> test logs. They could catch new flakiness introduced by their PR, file a >>>> JIRA for a flakiness that was not noted before, or ping an existing JIRA >>>> issue/raise its severity. On the downside this will require PR authors to >>>> do more. >>>> >>>> On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton <tyso...@google.com> >>>> wrote: >>>> >>>>> Adding retries can be beneficial in two ways, unblocking a PR, and >>>>> collecting metrics about the flakes. >>>>> >>>> >>>> Makes sense. I think we will still need to have a plan to remove >>>> retries similar to re-enabling disabled tests. >>>> >>>> >>>>> >>>>> If we also had a flaky test leaderboard that showed which tests are >>>>> the most flaky, then we could take action on them. Encouraging someone >>>>> from >>>>> the community to fix the flaky test is another issue. >>>>> >>>>> The test status matrix of tests that is on the GitHub landing page >>>>> could show flake level to communicate to users which modules are losing a >>>>> trustable test signal. Maybe this shows up as a flake % or a code coverage >>>>> % that decreases due to disabled flaky tests. >>>>> >>>> >>>> +1 to a dashboard that will show a "leaderboard" of flaky tests. >>>> >>>> >>>>> >>>>> I didn't look for plugins, just dreaming up some options. >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Jul 16, 2020, 5:58 PM Luke Cwik <lc...@google.com> wrote: >>>>> >>>>>> What do other Apache projects do to address this issue? >>>>>> >>>>>> On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay <al...@google.com> wrote: >>>>>> >>>>>>> I agree with the comments in this thread. >>>>>>> - If we are not re-enabling tests back again or we do not have a >>>>>>> plan to re-enable them again, disabling tests only provides us temporary >>>>>>> relief until eventually users find issues instead of disabled tests. >>>>>>> - I feel similarly about retries. It is reasonable to add retries >>>>>>> for reasons we understand. Adding retries to avoid flakes is similar to >>>>>>> disabling tests. They might hide real issues. >>>>>>> >>>>>>> I think we are missing a way for checking that we are making >>>>>>> progress on P1 issues. For example, P0 issues block releases and this >>>>>>> obviously results in fixing/triaging/addressing P0 issues at least >>>>>>> every 6 >>>>>>> weeks. We do not have a similar process for flaky tests. I do not know >>>>>>> what >>>>>>> would be a good policy. One suggestion is to ping (email/slack) >>>>>>> assignees >>>>>>> of issues. I recently missed a flaky issue that was assigned to me. A >>>>>>> ping >>>>>>> like that would have reminded me. And if an assignee cannot help/does >>>>>>> not >>>>>>> have the time, we can try to find a new assignee. >>>>>>> >>>>>>> Ahmet >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev < >>>>>>> valen...@google.com> wrote: >>>>>>> >>>>>>>> I think the original discussion[1] on introducing tenacity might >>>>>>>> answer that question. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E >>>>>>>> >>>>>>>> On Thu, Jul 16, 2020 at 10:48 AM Rui Wang <ruw...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Is there an observation that enabling tenacity improves the >>>>>>>>> development experience on Python SDK? E.g. less wait time to get PR >>>>>>>>> pass >>>>>>>>> and merged? Or it might be a matter of a right number of retry to >>>>>>>>> align >>>>>>>>> with the "flakiness" of a test? >>>>>>>>> >>>>>>>>> >>>>>>>>> -Rui >>>>>>>>> >>>>>>>>> On Thu, Jul 16, 2020 at 10:38 AM Valentyn Tymofieiev < >>>>>>>>> valen...@google.com> wrote: >>>>>>>>> >>>>>>>>>> We used tenacity[1] to retry some unit tests for which we >>>>>>>>>> understood the nature of flakiness. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156 >>>>>>>>>> >>>>>>>>>> On Thu, Jul 16, 2020 at 10:25 AM Kenneth Knowles <k...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Didn't we use something like that flaky retry plugin for Python >>>>>>>>>>> tests at some point? Adding retries may be preferable to disabling >>>>>>>>>>> the >>>>>>>>>>> test. We need a process to remove the retries ASAP though. As Luke >>>>>>>>>>> says >>>>>>>>>>> that is not so easy to make happen. Having a way to make P1 bugs >>>>>>>>>>> more >>>>>>>>>>> visible in an ongoing way may help. >>>>>>>>>>> >>>>>>>>>>> Kenn >>>>>>>>>>> >>>>>>>>>>> On Thu, Jul 16, 2020 at 8:57 AM Luke Cwik <lc...@google.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I don't think I have seen tests that were previously disabled >>>>>>>>>>>> become re-enabled. >>>>>>>>>>>> >>>>>>>>>>>> It seems as though we have about ~60 disabled tests in Java and >>>>>>>>>>>> ~15 in Python. Half of the Java ones seem to be in ZetaSQL/SQL due >>>>>>>>>>>> to >>>>>>>>>>>> missing features so unrelated to being a flake. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jul 16, 2020 at 8:49 AM Gleb Kanterov <g...@spotify.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> There is something called test-retry-gradle-plugin [1]. It >>>>>>>>>>>>> retries tests if they fail, and have different modes to handle >>>>>>>>>>>>> flaky tests. >>>>>>>>>>>>> Did we ever try or consider using it? >>>>>>>>>>>>> >>>>>>>>>>>>> [1]: https://github.com/gradle/test-retry-gradle-plugin >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov < >>>>>>>>>>>>> g...@spotify.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I agree with what Ahmet is saying. I can share my >>>>>>>>>>>>>> perspective, recently I had to retrigger build 6 times due to >>>>>>>>>>>>>> flaky tests, >>>>>>>>>>>>>> and each retrigger took one hour of waiting time. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I've seen examples of automatic tracking of flaky tests, >>>>>>>>>>>>>> where a test is considered flaky if both fails and succeeds for >>>>>>>>>>>>>> the same >>>>>>>>>>>>>> git SHA. Not sure if there is anything we can enable to get this >>>>>>>>>>>>>> automatically. >>>>>>>>>>>>>> >>>>>>>>>>>>>> /Gleb >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay <al...@google.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think it will be reasonable to disable/sickbay any flaky >>>>>>>>>>>>>>> test that is actively blocking people. Collective cost of flaky >>>>>>>>>>>>>>> tests for >>>>>>>>>>>>>>> such a large group of contributors is very significant. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Most of these issues are unassigned. IMO, it makes sense to >>>>>>>>>>>>>>> assign these issues to the most relevant person (who added the >>>>>>>>>>>>>>> test/who >>>>>>>>>>>>>>> generally maintains those components). Those people can either >>>>>>>>>>>>>>> fix and >>>>>>>>>>>>>>> re-enable the tests, or remove them if they no longer provide >>>>>>>>>>>>>>> valuable >>>>>>>>>>>>>>> signals. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ahmet >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles < >>>>>>>>>>>>>>> k...@apache.org> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The situation is much worse than that IMO. My experience of >>>>>>>>>>>>>>>> the last few days is that a large portion of time went to >>>>>>>>>>>>>>>> *just connecting >>>>>>>>>>>>>>>> failing runs with the corresponding Jira tickets or filing new >>>>>>>>>>>>>>>> ones*. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Summarized on PRs: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12272#issuecomment-659050891 >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12273#issuecomment-659070317 >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-656973073 >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-657743373 >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12224#issuecomment-657744481 >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657735289 >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657780781 >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657799415 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The tickets: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10460 >>>>>>>>>>>>>>>> SparkPortableExecutionTest >>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10471 >>>>>>>>>>>>>>>> CassandraIOTest > testEstimatedSizeBytes >>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10504 >>>>>>>>>>>>>>>> ElasticSearchIOTest > testWriteFullAddressing and >>>>>>>>>>>>>>>> testWriteWithIndexFn >>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10470 >>>>>>>>>>>>>>>> JdbcDriverTest >>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-8025 >>>>>>>>>>>>>>>> CassandraIOTest > @BeforeClass (classmethod) >>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-8454 >>>>>>>>>>>>>>>> FnHarnessTest >>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10506 >>>>>>>>>>>>>>>> SplunkEventWriterTest >>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10472 direct >>>>>>>>>>>>>>>> runner ParDoLifecycleTest >>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-9187 >>>>>>>>>>>>>>>> DefaultJobBundleFactoryTest >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here are our P1 test flake bugs: >>>>>>>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It seems quite a few of them are actively hindering people >>>>>>>>>>>>>>>> right now. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Kenn >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud < >>>>>>>>>>>>>>>> apill...@google.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We have two test suites that are responsible for a large >>>>>>>>>>>>>>>>> percentage of our flaky tests and both have bugs open for >>>>>>>>>>>>>>>>> about a year >>>>>>>>>>>>>>>>> without being fixed. These suites are ParDoLifecycleTest ( >>>>>>>>>>>>>>>>> BEAM-8101 >>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-8101>) in >>>>>>>>>>>>>>>>> Java and BigQueryWriteIntegrationTests in python (py3 >>>>>>>>>>>>>>>>> BEAM-9484 >>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9484>, py2 >>>>>>>>>>>>>>>>> BEAM-9232 >>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9232>, old >>>>>>>>>>>>>>>>> duplicate BEAM-8197 >>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-8197>). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Are there any volunteers to look into these issues? What >>>>>>>>>>>>>>>>> can we do to mitigate the flakiness until someone has time to >>>>>>>>>>>>>>>>> investigate? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Andrew >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>