Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/10540 )
Change subject: [tools] extra integration tests for the rebalancer ...................................................................... Patch Set 10: (8 comments) http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc File src/kudu/tools/kudu-admin-test.cc: http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1539 PS10, Line 1539: ASSERT_STR_CONTAINS(out, "rebalancing is complete: cluster is balanced") : << "stderr: " << err; > No need to check this again. Done http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1690 PS10, Line 1690: do_create_tables = false; : do_alter_tables = false; : do_delete_tables = false; > Not sure I see the point of separating these. Wouldn't a single CountDownLa Using separate atomics per corresponding thread seemed simpler to me without bringing in unneeded sync-wait primitives. But if using CountDownLatch looks more elegant -- sure, let's do that. http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1711 PS10, Line 1711: timer.join(); : timer_cleanup.cancel(); > The use of another thread just to do timing seems unnecessarily complex. Ho The main purpose of this is to allow the rebalancer to run concurrently with those DDL operations threads. Having the rebalancer running in the main thread is beneficial in case of errors, if any -- at least the test would not crash and it's easier to debug it in that case. I added corresponding comment, trying to clarify on that. http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1750 PS10, Line 1750: if (!AllowSlowTests()) { : LOG(WARNING) << "test is skipped; set KUDU_ALLOW_SLOW_TESTS=1 to run"; : return; : } : : const auto is_343_scheme = (GetParam() == Kudu1097::Enable); : constexpr int kRepFactor = 3; : constexpr int kNumTservers = 2 * (kRepFactor + 1); : constexpr int kNumTables = 10; : const string table_name_pattern = "rebalance_test_table_$0"; : constexpr int kTserverUnresponsiveMs = 3000; : const vector<string> kMasterFlags = { : Substitute("--raft_prepare_replacement_before_eviction=$0", is_343_scheme), : Substitute("--tserver_unresponsive_timeout_ms=$0", kTserverUnresponsiveMs), : }; : const vector<string> kTserverFlags = { : Substitute("--raft_prepare_replacement_before_eviction=$0", is_343_scheme), : }; : : FLAGS_num_tablet_servers = kNumTservers; : FLAGS_num_replicas = kRepFactor; : NO_FATALS(BuildAndStart(kTserverFlags, kMasterFlags)); : : ASSERT_OK(CreateUnbalancedTables( : cluster_.get(), client_.get(), schema_, table_name_pattern, kNumTables, : kRepFactor, kRepFactor + 1, kNumTservers, kTserverUnresponsiveMs)); > This is common to all of these new tests; can it be shared? Done http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1808 PS10, Line 1808: // Run another rebalancer concurrently in the main test thread. > Hmm, is this enough concurrency to be worthwhile? How long does each rebala The idea is to make sure the cluster state would be consistent and rebalancers would not crush if running concurrently. In essence, if each rebalancer follows the same greedy algorithm with minimal variations (which is true), not many rebalancers are necessary to spot a problem here, if any. However, I agree it makes sense to have more and start them 'more synchronously'. http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1887 PS10, Line 1887: const auto sleep_sec = 5 + (r.Next() % 3); > Seems like we've picked a particular range of sleeping durations that we th Yep, that was about having some 'precise' timing. The idea was that the desired timing would be so for almost all runs, and that seems to be true even with dist-test and TSAN builds. Flipping a tserver on/off is not exactly what is needed here, and I'm not sure it's possible to track the 'inacceptable health status UNAVAILABLE' error in that case reliably at all. I updated the scenario to rely on the state of the cluster that 'kudu ksck' sees when the rebalancer runs, making this scenario much more robust. http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1902 PS10, Line 1902: "--max_moves_per_server=2", > Why is this flag important? Done http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1975 PS10, Line 1975: SleepFor(MonoDelta::FromSeconds(3 + (r.Next() % 3))); > See above; this also seems too delicately chosen to be robust. Done -- To view, visit http://gerrit.cloudera.org:8080/10540 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I78b3dcea71ed303f6ecd199604b2385796d05da8 Gerrit-Change-Number: 10540 Gerrit-PatchSet: 10 Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-Comment-Date: Mon, 02 Jul 2018 21:26:08 +0000 Gerrit-HasComments: Yes
