delete_table-test: fix flakiness with table creation timeout This test was timing out frequently when trying to create a replication-2 table on a cluster with 3 tservers, one of which was recently shut down. The master could try to place a replica on the non-running server, which would then take some time to time out and try a new placement.
The workaround here is to restart the master so it no longer sees the crashed server as a valid placement option. Change-Id: Ic61ad384e1b247f83bfc709528c4c7bda586c9d2 Reviewed-on: http://gerrit.cloudera.org:8080/4632 Reviewed-by: David Ribeiro Alves <[email protected]> Reviewed-by: Dinesh Bhat <[email protected]> Tested-by: Kudu Jenkins Project: http://git-wip-us.apache.org/repos/asf/kudu/repo Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/98f42cdd Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/98f42cdd Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/98f42cdd Branch: refs/heads/master Commit: 98f42cdd878caa429377625a2288d22ed0d114f2 Parents: 0f99d40 Author: Todd Lipcon <[email protected]> Authored: Wed Oct 5 10:52:29 2016 -0700 Committer: David Ribeiro Alves <[email protected]> Committed: Wed Oct 5 20:26:40 2016 +0000 ---------------------------------------------------------------------- src/kudu/integration-tests/delete_table-test.cc | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/kudu/blob/98f42cdd/src/kudu/integration-tests/delete_table-test.cc ---------------------------------------------------------------------- diff --git a/src/kudu/integration-tests/delete_table-test.cc b/src/kudu/integration-tests/delete_table-test.cc index 6a0de2f..d331d43 100644 --- a/src/kudu/integration-tests/delete_table-test.cc +++ b/src/kudu/integration-tests/delete_table-test.cc @@ -432,7 +432,7 @@ TEST_F(DeleteTableTest, TestAutoTombstoneAfterCrashDuringTabletCopy) { ASSERT_OK(cluster_->master()->Restart()); ASSERT_OK(cluster_->WaitForTabletServerCount(1, MonoDelta::FromSeconds(30))); - // Set up a table which has a table only on TS 0. This will be used to test for + // Set up a table which has a tablet only on TS 0. This will be used to test for // "collateral damage" bugs where incorrect handling of the main test tablet // accidentally removes blocks from another tablet. // We use a sequential workload so that we just flush and don't compact. @@ -467,7 +467,15 @@ TEST_F(DeleteTableTest, TestAutoTombstoneAfterCrashDuringTabletCopy) { ASSERT_OK(cluster_->tablet_server(2)->Restart()); cluster_->tablet_server(kTsIndex)->Shutdown(); - // Create a new tablet which is replicated on the other two servers. + // Restart the master to be sure that it only sees the live servers. + // Otherwise it may try to create a tablet with a replica on the down server. + // The table creation would eventually succeed after picking a different set of + // replicas, but not before causing a timeout. + cluster_->master()->Shutdown(); + ASSERT_OK(cluster_->master()->Restart()); + ASSERT_OK(cluster_->WaitForTabletServerCount(2, MonoDelta::FromSeconds(30))); + + // Create a new table with a single tablet replicated on the other two servers. // We use the same sequential workload. This produces block ID sequences // that look like: // TS 0: |---- blocks from 'other-table' ---]
