Hi MXNet community,

Thanks to the efforts of several community members, we identified many
flaky tests. These tests are currently disabled to ensure the smooth
execution of continuous integration (CI). As a result, we lost coverage on
those features. They need fixing and to be re-enabled to ensure the quality
of our releases. I'd like to propose the following:

1, Re-enable flaky python tests with retries if feasible
Although the tests are unstable, they would still be able to catch breaking
changes. For example, suppose a test fails randomly with 10% probability,
the probability of three failed retries become 0.1%. On the other hand, a
breaking change would result in 100% failure. Although this could increase
the testing time, it's a compromise that can help avoid bigger problem.

2, Set standard for new tests
I think having criteria that new tests should follow can help improve the
quality of tests, but also the quality of code. I propose the following
standard for tests.
- Reliably passing with good coverage
- Avoid randomness unless necessary
- Avoid external dependency unless necessary (e.g. due to license)
- Not resource-intensive unless necessary (e.g. scaling tests)

In addition, I'd like to call for volunteers on helping with the fix of
tests. New members are especially welcome, as it's a good opportunity to
familiarize with MXNet. Also, I'd like to request that members who wrote
the feature/test could help either by fixing, or by helping others
understand the issues.

The effort on fixing the tests is tracked at:
https://github.com/apache/incubator-mxnet/issues/9412

Best regards,
Sheng

Reply via email to