Hello Todd Lipcon, I'd like you to do a code review. Please visit
http://gerrit.cloudera.org:8080/10069 to review the following change. Change subject: test_util: include shard number in test scratch directory name ...................................................................... test_util: include shard number in test scratch directory name Amongst other things, build-and-test.sh checks that every test cleans up its scratch directory. It only performs this check if all tests pass, because if a test fails, it'll leave behind its scratch directory for inspection. This gets a little murky when considering flaky tests, because if one fails, it is retried a few times. As such, it's important that when run-test.sh prepares to rerun a flaky test, it cleans up the scratch directory left behind by the failed run. Unfortunately, run-test.sh isn't really aware of what it needs to clean up, because each test in the test executable creates its own test-specific scratch directory within a common root. To account for this, run-test.sh figures out what to clean using the following algorithm: 1. Before the test, capture the list of all directories within TEST_TMPDIR. 2. After the test, repeat the capture. 3. Diff the two captures to extract a list of directories that exist after but not before. 4. Search for the test name as a prefix in the extracted list. While complex, the algorithm does match and delete all scratch directories that were left behind by a particular test executable, even if TEST_TMPDIR is in use by other tests running in parallel. However, sharded tests break the algorithm, because the test name now includes a shard index, and that index isn't included in the scratch directory's name. I experimented with various fixes that included giving run-test.sh more ownership over TEST_TMPDIR, but eventually settled on a smaller and more targeted fix: if the test is being sharded, include the shard index in the scratch directory name. I tested this patch by running: TEST_TMPDIR=/tmp/kudutest-1000 \ KUDU_FLAKY_TEST_ATTEMPTS=3 \ KUDU_FLAKY_TEST_LIST=<(echo raft_consensus-itest) \ ctest -j8 -R raft_consensus-itest In the background, I also induced stress on the machine. This caused RaftConsensusParamReplicationModesITest.Test_KUDU_1735/1 to flake and fail from time to time. Without the patch, the flaky test's scratch directory was always left behind in TEST_TMPDIR. With it, the directory was always gone, unless the test failed every time it was run, which is expected behavior. Change-Id: Ie5686eb7c3a011a94b6c8c4eadbedb23acbf6e60 --- M src/kudu/util/test_util.cc 1 file changed, 10 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/69/10069/1 -- To view, visit http://gerrit.cloudera.org:8080/10069 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Ie5686eb7c3a011a94b6c8c4eadbedb23acbf6e60 Gerrit-Change-Number: 10069 Gerrit-PatchSet: 1 Gerrit-Owner: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org>