Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/10540 )
Change subject: [tools] Extra integration tests for the rebalancer ...................................................................... Patch Set 10: (8 comments) These new tests are pretty complex; can you make sure they're robust by looping them in dist-test, both with and without TSAN, and with stress threads? http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc File src/kudu/tools/kudu-admin-test.cc: http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1539 PS10, Line 1539: ASSERT_STR_CONTAINS(out, "rebalancing is complete: cluster is balanced") : << "stderr: " << err; No need to check this again. http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1690 PS10, Line 1690: do_create_tables = false; : do_alter_tables = false; : do_delete_tables = false; Not sure I see the point of separating these. Wouldn't a single CountDownLatch be better? The nice thing about a latch is that the threads could sleep and check the latch at the same time with WaitFor(). http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1711 PS10, Line 1711: timer.join(); : timer_cleanup.cancel(); The use of another thread just to do timing seems unnecessarily complex. How about capturing MonoTime::Now() before the first rebalancer run, setting up a deadline using that value plus 30s, then waiting for that deadline here? http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1750 PS10, Line 1750: if (!AllowSlowTests()) { : LOG(WARNING) << "test is skipped; set KUDU_ALLOW_SLOW_TESTS=1 to run"; : return; : } : : const auto is_343_scheme = (GetParam() == Kudu1097::Enable); : constexpr int kRepFactor = 3; : constexpr int kNumTservers = 2 * (kRepFactor + 1); : constexpr int kNumTables = 10; : const string table_name_pattern = "rebalance_test_table_$0"; : constexpr int kTserverUnresponsiveMs = 3000; : const vector<string> kMasterFlags = { : Substitute("--raft_prepare_replacement_before_eviction=$0", is_343_scheme), : Substitute("--tserver_unresponsive_timeout_ms=$0", kTserverUnresponsiveMs), : }; : const vector<string> kTserverFlags = { : Substitute("--raft_prepare_replacement_before_eviction=$0", is_343_scheme), : }; : : FLAGS_num_tablet_servers = kNumTservers; : FLAGS_num_replicas = kRepFactor; : NO_FATALS(BuildAndStart(kTserverFlags, kMasterFlags)); : : ASSERT_OK(CreateUnbalancedTables( : cluster_.get(), client_.get(), schema_, table_name_pattern, kNumTables, : kRepFactor, kRepFactor + 1, kNumTservers, kTserverUnresponsiveMs)); This is common to all of these new tests; can it be shared? http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1808 PS10, Line 1808: // Run another rebalancer concurrently in the main test thread. Hmm, is this enough concurrency to be worthwhile? How long does each rebalancer take to run? Should we spin up more than 2 at the same time? http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1887 PS10, Line 1887: const auto sleep_sec = 5 + (r.Next() % 3); Seems like we've picked a particular range of sleeping durations that we think is precisely enough to stop a tserver mid-rebalance. That isn't particularly robust; what if it's a TSAN environment where everything is moving more slowly, and we end up shutting down the tserver well before the rebalance starts? Or after it ends? What if instead we flipped the same tserver on and off repeatedly during the rebalance? I think that'd increase the chances of an overlap. Would it prevent the test from passing? http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1902 PS10, Line 1902: "--max_moves_per_server=2", Why is this flag important? http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1975 PS10, Line 1975: SleepFor(MonoDelta::FromSeconds(3 + (r.Next() % 3))); See above; this also seems too delicately chosen to be robust. -- To view, visit http://gerrit.cloudera.org:8080/10540 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I78b3dcea71ed303f6ecd199604b2385796d05da8 Gerrit-Change-Number: 10540 Gerrit-PatchSet: 10 Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-Comment-Date: Thu, 21 Jun 2018 23:36:36 +0000 Gerrit-HasComments: Yes
