Hi MXNet community, Thanks to the efforts of several community members, we identified many flaky tests. These tests are currently disabled to ensure the smooth execution of continuous integration (CI). As a result, we lost coverage on those features. They need fixing and to be re-enabled to ensure the quality of our releases. I'd like to propose the following:
1, Re-enable flaky python tests with retries if feasible Although the tests are unstable, they would still be able to catch breaking changes. For example, suppose a test fails randomly with 10% probability, the probability of three failed retries become 0.1%. On the other hand, a breaking change would result in 100% failure. Although this could increase the testing time, it's a compromise that can help avoid bigger problem. 2, Set standard for new tests I think having criteria that new tests should follow can help improve the quality of tests, but also the quality of code. I propose the following standard for tests. - Reliably passing with good coverage - Avoid randomness unless necessary - Avoid external dependency unless necessary (e.g. due to license) - Not resource-intensive unless necessary (e.g. scaling tests) In addition, I'd like to call for volunteers on helping with the fix of tests. New members are especially welcome, as it's a good opportunity to familiarize with MXNet. Also, I'd like to request that members who wrote the feature/test could help either by fixing, or by helping others understand the issues. The effort on fixing the tests is tracked at: https://github.com/apache/incubator-mxnet/issues/9412 Best regards, Sheng