Hello Todd Lipcon,

I'd like you to do a code review. Please visit


to review the following change.

Change subject: test_util: include shard number in test scratch directory name

test_util: include shard number in test scratch directory name

Amongst other things, build-and-test.sh checks that every test cleans up its
scratch directory. It only performs this check if all tests pass, because if
a test fails, it'll leave behind its scratch directory for inspection.

This gets a little murky when considering flaky tests, because if one fails,
it is retried a few times. As such, it's important that when run-test.sh
prepares to rerun a flaky test, it cleans up the scratch directory left
behind by the failed run. Unfortunately, run-test.sh isn't really aware of
what it needs to clean up, because each test in the test executable creates
its own test-specific scratch directory within a common root. To account for
this, run-test.sh figures out what to clean using the following algorithm:
1. Before the test, capture the list of all directories within TEST_TMPDIR.
2. After the test, repeat the capture.
3. Diff the two captures to extract a list of directories that exist after
   but not before.
4. Search for the test name as a prefix in the extracted list.

While complex, the algorithm does match and delete all scratch directories
that were left behind by a particular test executable, even if TEST_TMPDIR
is in use by other tests running in parallel. However, sharded tests break
the algorithm, because the test name now includes a shard index, and that
index isn't included in the scratch directory's name.

I experimented with various fixes that included giving run-test.sh more
ownership over TEST_TMPDIR, but eventually settled on a smaller and more
targeted fix: if the test is being sharded, include the shard index in the
scratch directory name.

I tested this patch by running:

  TEST_TMPDIR=/tmp/kudutest-1000 \
  KUDU_FLAKY_TEST_LIST=<(echo raft_consensus-itest) \
    ctest -j8 -R raft_consensus-itest

In the background, I also induced stress on the machine. This caused
RaftConsensusParamReplicationModesITest.Test_KUDU_1735/1 to flake and fail
from time to time.

Without the patch, the flaky test's scratch directory was always left behind
in TEST_TMPDIR. With it, the directory was always gone, unless the test
failed every time it was run, which is expected behavior.

Change-Id: Ie5686eb7c3a011a94b6c8c4eadbedb23acbf6e60
M src/kudu/util/test_util.cc
1 file changed, 10 insertions(+), 1 deletion(-)

  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/69/10069/1
To view, visit http://gerrit.cloudera.org:8080/10069
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie5686eb7c3a011a94b6c8c4eadbedb23acbf6e60
Gerrit-Change-Number: 10069
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to