Re: Chronically flaky tests

Ahmet Altay Thu, 30 Jul 2020 18:24:43 -0700

I like:
*Include ignored or quarantined tests in the release notes*
*Run flaky tests only in postcommit* (related? *Separate flaky tests into
quarantine job*)
*Require link to Jira to rerun a test*


I am concerned about:
*Add Gradle or Jenkins plugin to retry flaky tests* - because it is a
convenient place for real bugs to hide.

I do not know much about:
*Consider Gradle Enterprise*
https://testautonation.com/analyse-test-results-deflake-flaky-tests/

Thank you for putting this list! I believe even if we can commit to doing
some of these we would have a much healthier project. If we can build
consensus on implementing, I will be happy to work on some of them.

On Fri, Jul 24, 2020 at 1:54 PM Kenneth Knowles <k...@apache.org> wrote:

> Adding
> https://testautonation.com/analyse-test-results-deflake-flaky-tests/ to
> the list which seems a more powerful test history tool.
>
> On Fri, Jul 24, 2020 at 1:51 PM Kenneth Knowles <k...@apache.org> wrote:
>
>> Had some off-list chats to brainstorm and I wanted to bring ideas back to
>> the dev@ list for consideration. A lot can be combined. I would really
>> like to have a section in the release notes. I like the idea of banishing
>> flakes from pre-commit (since you can't tell easily if it was a real
>> failure caused by the PR) and auto-retrying in post-commit (so we can
>> gather data on exactly what is flaking without a lot of manual
>> investigation).
>>
>> *Include ignored or quarantined tests in the release notes*
>> Pro:
>>  - Users are aware of what is not being tested so may be silently broken
>>  - It forces discussion of ignored tests to be part of our community
>> processes
>> Con:
>>  - It may look bad if the list is large (this is actually also a Pro
>> because if it looks bad, it is bad)
>>
>> *Run flaky tests only in postcommit*
>> Pro:
>>  - isolates the bad signal so pre-commit is not affected
>>  - saves pointless re-runs in pre-commit
>>  - keeps a signal in post-commit that we can watch, instead of losing it
>> completely when we disable a test
>>  - maybe keeps the flaky tests in job related to what they are testing
>> Con:
>>  - we have to really watch post-commit or flakes can turn into failures
>>
>> *Separate flaky tests into quarantine job*
>> Pro:
>>  - gain signal for healthy tests, as with disabling or running in
>> post-commit
>>  - also saves pointless re-runs
>> Con:
>>  - may collect bad tests so that we never look at it so it is the same as
>> disabling the test
>>  - lots of unrelated tests grouped into signal instead of focused on
>> health of a particular component
>>
>> *Add Gradle or Jenkins plugin to retry flaky tests*
>> https://blog.gradle.org/gradle-flaky-test-retry-plugin
>> https://plugins.jenkins.io/flaky-test-handler/
>> Pro:
>>  - easier than Jiras with human pasting links; works with moving flakes
>> to post-commit
>>  - get a somewhat automated view of flakiness, whether in pre-commit or
>> post-commit
>>  - don't get stopped by flakiness
>> Con:
>>  - maybe too easy to ignore flakes; we should add all flakes (not just
>> disabled or quarantined) to the release notes
>>  - sometimes flakes are actual bugs (like concurrency) so treating this
>> as OK is not desirable
>>  - without Jiras, no automated release notes
>>  - Jenkins: retry only will work at job level because it needs Maven to
>> retry only failed (I think)
>>  - Jenkins: some of our jobs may have duplicate test names (but might
>> already be fixed)
>>
>> *Consider Gradle Enterprise*
>> Pro:
>>  - get Gradle scan granularity of flake data (and other stuff)
>>  - also gives module-level health which we do not have today
>> Con:
>>  - cost and administrative burden unknown
>>  - we probably have to do some small work to make our jobs compatible
>> with their history tracking
>>
>> *Require link to Jira to rerun a test*
>> Instead of saying "Run Java PreCommit" you have to link to the bug
>> relating to the failure.
>> Pro:
>>  - forces investigation
>>  - helps others find out about issues
>> Con:
>>  - adds a lot of manual work, or requires automation (which will probably
>> be ad hoc and fragile)
>>
>> Kenn
>>
>> On Mon, Jul 20, 2020 at 11:59 AM Brian Hulette <bhule...@google.com>
>> wrote:
>>
>>> > I think we are missing a way for checking that we are making progress
>>> on P1 issues. For example, P0 issues block releases and this obviously
>>> results in fixing/triaging/addressing P0 issues at least every 6 weeks. We
>>> do not have a similar process for flaky tests. I do not know what would be
>>> a good policy. One suggestion is to ping (email/slack) assignees of issues.
>>> I recently missed a flaky issue that was assigned to me. A ping like that
>>> would have reminded me. And if an assignee cannot help/does not have the
>>> time, we can try to find a new assignee.
>>>
>>> Yeah I think this is something we should address. With the new jira
>>> automation at least assignees should get an email notification after 30
>>> days because of a jira comment like [1], but that's too long to let a test
>>> continue to flake. Could Beam Jira Bot ping every N days for P1s that
>>> aren't making progress?
>>>
>>> That wouldn't help us with P1s that have no assignee, or are assigned to
>>> overloaded people. It seems we'd need some kind of dashboard or report to
>>> capture those.
>>>
>>> [1]
>>> https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918
>>>
>>> On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay <al...@google.com> wrote:
>>>
>>>> Another idea, could we change our "Retest X" phrases with "Retest X
>>>> (Reason)" phrases? With this change a PR author will have to look at failed
>>>> test logs. They could catch new flakiness introduced by their PR, file a
>>>> JIRA for a flakiness that was not noted before, or ping an existing JIRA
>>>> issue/raise its severity. On the downside this will require PR authors to
>>>> do more.
>>>>
>>>> On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton <tyso...@google.com>
>>>> wrote:
>>>>
>>>>> Adding retries can be beneficial in two ways, unblocking a PR, and
>>>>> collecting metrics about the flakes.
>>>>>
>>>>
>>>> Makes sense. I think we will still need to have a plan to remove
>>>> retries similar to re-enabling disabled tests.
>>>>
>>>>
>>>>>
>>>>> If we also had a flaky test leaderboard that showed which tests are
>>>>> the most flaky, then we could take action on them. Encouraging someone 
>>>>> from
>>>>> the community to fix the flaky test is another issue.
>>>>>
>>>>> The test status matrix of tests that is on the GitHub landing page
>>>>> could show flake level to communicate to users which modules are losing a
>>>>> trustable test signal. Maybe this shows up as a flake % or a code coverage
>>>>> % that decreases due to disabled flaky tests.
>>>>>
>>>>
>>>> +1 to a dashboard that will show a "leaderboard" of flaky tests.
>>>>
>>>>
>>>>>
>>>>> I didn't look for plugins, just dreaming up some options.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 16, 2020, 5:58 PM Luke Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> What do other Apache projects do to address this issue?
>>>>>>
>>>>>> On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay <al...@google.com> wrote:
>>>>>>
>>>>>>> I agree with the comments in this thread.
>>>>>>> - If we are not re-enabling tests back again or we do not have a
>>>>>>> plan to re-enable them again, disabling tests only provides us temporary
>>>>>>> relief until eventually users find issues instead of disabled tests.
>>>>>>> - I feel similarly about retries. It is reasonable to add retries
>>>>>>> for reasons we understand. Adding retries to avoid flakes is similar to
>>>>>>> disabling tests. They might hide real issues.
>>>>>>>
>>>>>>> I think we are missing a way for checking that we are making
>>>>>>> progress on P1 issues. For example, P0 issues block releases and this
>>>>>>> obviously results in fixing/triaging/addressing P0 issues at least 
>>>>>>> every 6
>>>>>>> weeks. We do not have a similar process for flaky tests. I do not know 
>>>>>>> what
>>>>>>> would be a good policy. One suggestion is to ping (email/slack) 
>>>>>>> assignees
>>>>>>> of issues. I recently missed a flaky issue that was assigned to me. A 
>>>>>>> ping
>>>>>>> like that would have reminded me. And if an assignee cannot help/does 
>>>>>>> not
>>>>>>> have the time, we can try to find a new assignee.
>>>>>>>
>>>>>>> Ahmet
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev <
>>>>>>> valen...@google.com> wrote:
>>>>>>>
>>>>>>>> I think the original discussion[1] on introducing tenacity might
>>>>>>>> answer that question.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E
>>>>>>>>
>>>>>>>> On Thu, Jul 16, 2020 at 10:48 AM Rui Wang <ruw...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Is there an observation that enabling tenacity improves the
>>>>>>>>> development experience on Python SDK? E.g. less wait time to get PR 
>>>>>>>>> pass
>>>>>>>>> and merged? Or it might be a matter of a right number of retry to 
>>>>>>>>> align
>>>>>>>>> with the "flakiness" of a test?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Rui
>>>>>>>>>
>>>>>>>>> On Thu, Jul 16, 2020 at 10:38 AM Valentyn Tymofieiev <
>>>>>>>>> valen...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> We used tenacity[1] to retry some unit tests for which we
>>>>>>>>>> understood the nature of flakiness.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 16, 2020 at 10:25 AM Kenneth Knowles <k...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Didn't we use something like that flaky retry plugin for Python
>>>>>>>>>>> tests at some point? Adding retries may be preferable to disabling 
>>>>>>>>>>> the
>>>>>>>>>>> test. We need a process to remove the retries ASAP though. As Luke 
>>>>>>>>>>> says
>>>>>>>>>>> that is not so easy to make happen. Having a way to make P1 bugs 
>>>>>>>>>>> more
>>>>>>>>>>> visible in an ongoing way may help.
>>>>>>>>>>>
>>>>>>>>>>> Kenn
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 16, 2020 at 8:57 AM Luke Cwik <lc...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I don't think I have seen tests that were previously disabled
>>>>>>>>>>>> become re-enabled.
>>>>>>>>>>>>
>>>>>>>>>>>> It seems as though we have about ~60 disabled tests in Java and
>>>>>>>>>>>> ~15 in Python. Half of the Java ones seem to be in ZetaSQL/SQL due 
>>>>>>>>>>>> to
>>>>>>>>>>>> missing features so unrelated to being a flake.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 16, 2020 at 8:49 AM Gleb Kanterov <g...@spotify.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> There is something called test-retry-gradle-plugin [1]. It
>>>>>>>>>>>>> retries tests if they fail, and have different modes to handle 
>>>>>>>>>>>>> flaky tests.
>>>>>>>>>>>>> Did we ever try or consider using it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]: https://github.com/gradle/test-retry-gradle-plugin
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov <
>>>>>>>>>>>>> g...@spotify.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree with what Ahmet is saying. I can share my
>>>>>>>>>>>>>> perspective, recently I had to retrigger build 6 times due to 
>>>>>>>>>>>>>> flaky tests,
>>>>>>>>>>>>>> and each retrigger took one hour of waiting time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've seen examples of automatic tracking of flaky tests,
>>>>>>>>>>>>>> where a test is considered flaky if both fails and succeeds for 
>>>>>>>>>>>>>> the same
>>>>>>>>>>>>>> git SHA. Not sure if there is anything we can enable to get this
>>>>>>>>>>>>>> automatically.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /Gleb
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay <al...@google.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think it will be reasonable to disable/sickbay any flaky
>>>>>>>>>>>>>>> test that is actively blocking people. Collective cost of flaky 
>>>>>>>>>>>>>>> tests for
>>>>>>>>>>>>>>> such a large group of contributors is very significant.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Most of these issues are unassigned. IMO, it makes sense to
>>>>>>>>>>>>>>> assign these issues to the most relevant person (who added the 
>>>>>>>>>>>>>>> test/who
>>>>>>>>>>>>>>> generally maintains those components). Those people can either 
>>>>>>>>>>>>>>> fix and
>>>>>>>>>>>>>>> re-enable the tests, or remove them if they no longer provide 
>>>>>>>>>>>>>>> valuable
>>>>>>>>>>>>>>> signals.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ahmet
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles <
>>>>>>>>>>>>>>> k...@apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The situation is much worse than that IMO. My experience of
>>>>>>>>>>>>>>>> the last few days is that a large portion of time went to 
>>>>>>>>>>>>>>>> *just connecting
>>>>>>>>>>>>>>>> failing runs with the corresponding Jira tickets or filing new 
>>>>>>>>>>>>>>>> ones*.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Summarized on PRs:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12272#issuecomment-659050891
>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12273#issuecomment-659070317
>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-656973073
>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-657743373
>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12224#issuecomment-657744481
>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657735289
>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657780781
>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657799415
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The tickets:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10460
>>>>>>>>>>>>>>>> SparkPortableExecutionTest
>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10471
>>>>>>>>>>>>>>>> CassandraIOTest > testEstimatedSizeBytes
>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10504
>>>>>>>>>>>>>>>> ElasticSearchIOTest > testWriteFullAddressing and 
>>>>>>>>>>>>>>>> testWriteWithIndexFn
>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10470
>>>>>>>>>>>>>>>> JdbcDriverTest
>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8025
>>>>>>>>>>>>>>>> CassandraIOTest > @BeforeClass (classmethod)
>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8454
>>>>>>>>>>>>>>>> FnHarnessTest
>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10506
>>>>>>>>>>>>>>>> SplunkEventWriterTest
>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10472 direct
>>>>>>>>>>>>>>>> runner ParDoLifecycleTest
>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-9187
>>>>>>>>>>>>>>>> DefaultJobBundleFactoryTest
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here are our P1 test flake bugs:
>>>>>>>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It seems quite a few of them are actively hindering people
>>>>>>>>>>>>>>>> right now.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud <
>>>>>>>>>>>>>>>> apill...@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We have two test suites that are responsible for a large
>>>>>>>>>>>>>>>>> percentage of our flaky tests and  both have bugs open for 
>>>>>>>>>>>>>>>>> about a year
>>>>>>>>>>>>>>>>> without being fixed. These suites are ParDoLifecycleTest (
>>>>>>>>>>>>>>>>> BEAM-8101
>>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-8101>) in
>>>>>>>>>>>>>>>>> Java and BigQueryWriteIntegrationTests in python (py3
>>>>>>>>>>>>>>>>> BEAM-9484
>>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9484>, py2
>>>>>>>>>>>>>>>>> BEAM-9232
>>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9232>, old
>>>>>>>>>>>>>>>>> duplicate BEAM-8197
>>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-8197>).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Are there any volunteers to look into these issues? What
>>>>>>>>>>>>>>>>> can we do to mitigate the flakiness until someone has time to 
>>>>>>>>>>>>>>>>> investigate?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: Chronically flaky tests

Reply via email to