Today I finished merging a patch series that switches on dist-test
when running tests that power the flaky test dashboard (previously
they were run on a CentOS 6.6 VM). This wasn't done because we need
these tests to finish faster; it's because an increasing number of
tests are flaky in one environment but not in the other, which led to
all kinds of frustration, such as:
- The dashboard indicated that a test was flaky, but it was impossible
to repro the failure when looping the test in dist-test.
- Some tests were flaky in precommit but weren't recorded as flaky in
the dashboard.

Unifying the two environments should address this. As new data enters
the dashboard over the next week, we should see some flakes drop off
and new ones appear, and precommit should stop being flaky too (with
the exception of tests that actually fail even when retried 3 times).

It's worth noting that this transition means we lose some CentOS 6.6
test coverage. That's unfortunate; it is a runtime environment we tend
to care about. For now I don't have a great answer except that we
should be more diligent about testing CentOS 6.6 in ad hoc ways,
especially during a Kudu release. The flaky test dashboard does
support arbitrary tagging so we could run those tests in multiple
environments and differentiate between them in the dashboard, though
we'd probably need to change how we retrieve the list of known flakes
to avoid "crossing the streams". For now I recommend we wait and see
what the world looks like after this transition, and reevaluate if
we're not happy with the result.

Reply via email to