Hello,

I would like to present the results of the July fixing of flaky tests.


short version:

19 - number of platform-specific flakiness to start with: (14 different tests)


as a result:

11 platform-specific flakiness was fixed (caused by 8 different tests)

4 platform specific flakiness was still flaky (from 4 different tests)

4 cases of flakiness were blacklisted (2 different tests)


The table under the link below shows more detailed information about fixed 
tests.

https://wiki.qt.io/Fixed_flaky_tests_in_July_2022


long version:

How was the problem approached?

We collected data about flakiness from June, in July we created a list of top 
"worst" cases that failed integrations and we contacted module maintainers. We 
gave some time for changes to be merged and run a sufficient number of times to 
gain confidence that the fix actually worked - and in late August we checked 
the results again.


The complete lists of flaky tests from June that were being fixed in July can 
be found at this link:

https://testresults.qt.io/grafana/d/000000007/flaky-summary-ci-test-info?orgId=1&from=1656626400000&to=1659304799000&viewPanel=65


Which tests were taken into analysis?

The tests from dev branch that impacted negatively the integration system by 
causing at least 1 failure in any integrations and at least 1 flaky event.


What is the difference between a failed and a flaky test?

You can find a good explanation here:

https://testresults.qt.io/grafana/d/000000007/flaky-summary-ci-test-info?orgId=1&viewPanel=55<http://flaky_and_failed_test_definition>

and here:
https://testresults.qt.io/grafana/d/000000007/flaky-summary-ci-test-info?orgId=1&viewPanel=41


What is understood by a "test"?

A test is an umbrella term for a pair: test case and test function. A test case 
(usually a cpp file) contains several test functions that return results (pass, 
fail, or xfail). We collected and analyzed the results. Additionally, some 
tests contain data tags - test function arguments that also provide more 
detailed results, however, we do not store them, the granularity of the data 
ends at the test function level.


What is understood by a "platform-specific flakiness"?

A test runs on a specific platform - we describe it as "target operating 
system" and "target architecture". In most cases, flakiness is related to a 
particular test run on a specific platform.

E.g., test case: "tst_qmutex" , test function "more stress" can return stable 
results most platforms but be flaky on: MacOS_11 X86_64 or on Windows_10_21H2 
X86_64 . In such case, it will be counted as 2 "platform specific flakiness" 
(MacOS_11 and Windows_10_21H2_) caused by a single (unique, distinct) test.


Since July fixing provided good results, in August we repeated the procedure: 
we gathered data about the most damaging (failing integrations) flaky tests and 
we compared it to July, to make sure only "new" tests are on the list. August's 
failing flakiness can be viewed under the link below.  Developers and 
maintainers are welcome to check if their tests are on the list.

https://wiki.qt.io/Flaky_tests_that_caused_failures_in_August


Big thanks to everyone participating in fixing the tests!


Anna Wojciechowska


The notebooks used to prepare this analysis can be found at:

https://git.qt.io/qtqa/notebooks/-/tree/main/flakiness/august_2022

_______________________________________________
Development mailing list
[email protected]
https://lists.qt-project.org/listinfo/development

Reply via email to