Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10540 )

Change subject: [tools] extra integration tests for the rebalancer
......................................................................


Patch Set 10:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1539
PS10, Line 1539:     ASSERT_STR_CONTAINS(out, "rebalancing is complete: cluster 
is balanced")
               :         << "stderr: " << err;
> No need to check this again.
Done


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1690
PS10, Line 1690:     do_create_tables = false;
               :     do_alter_tables = false;
               :     do_delete_tables = false;
> Not sure I see the point of separating these. Wouldn't a single CountDownLa
Using separate atomics per corresponding thread seemed simpler to me without 
bringing in unneeded sync-wait primitives.  But if using CountDownLatch looks 
more elegant -- sure, let's do that.


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1711
PS10, Line 1711:       timer.join();
               :       timer_cleanup.cancel();
> The use of another thread just to do timing seems unnecessarily complex. Ho
The main purpose of this is to allow the rebalancer to run concurrently with 
those DDL operations threads.  Having the rebalancer running in the main thread 
is beneficial in case of errors, if any -- at least the test would not crash 
and it's easier to debug it in that case.

I added corresponding comment, trying to clarify on that.


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1750
PS10, Line 1750:   if (!AllowSlowTests()) {
               :     LOG(WARNING) << "test is skipped; set 
KUDU_ALLOW_SLOW_TESTS=1 to run";
               :     return;
               :   }
               :
               :   const auto is_343_scheme = (GetParam() == Kudu1097::Enable);
               :   constexpr int kRepFactor = 3;
               :   constexpr int kNumTservers = 2 * (kRepFactor + 1);
               :   constexpr int kNumTables = 10;
               :   const string table_name_pattern = "rebalance_test_table_$0";
               :   constexpr int kTserverUnresponsiveMs = 3000;
               :   const vector<string> kMasterFlags = {
               :     
Substitute("--raft_prepare_replacement_before_eviction=$0", is_343_scheme),
               :     Substitute("--tserver_unresponsive_timeout_ms=$0", 
kTserverUnresponsiveMs),
               :   };
               :   const vector<string> kTserverFlags = {
               :     
Substitute("--raft_prepare_replacement_before_eviction=$0", is_343_scheme),
               :   };
               :
               :   FLAGS_num_tablet_servers = kNumTservers;
               :   FLAGS_num_replicas = kRepFactor;
               :   NO_FATALS(BuildAndStart(kTserverFlags, kMasterFlags));
               :
               :   ASSERT_OK(CreateUnbalancedTables(
               :       cluster_.get(), client_.get(), schema_, 
table_name_pattern, kNumTables,
               :       kRepFactor, kRepFactor + 1, kNumTservers, 
kTserverUnresponsiveMs));
> This is common to all of these new tests; can it be shared?
Done


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1808
PS10, Line 1808:   // Run another rebalancer concurrently in the main test 
thread.
> Hmm, is this enough concurrency to be worthwhile? How long does each rebala
The idea is to make sure the cluster state would be consistent and rebalancers 
would not crush if running concurrently.  In essence, if each rebalancer 
follows the same greedy algorithm  with minimal variations (which is true), not 
many rebalancers are necessary to spot a problem here, if any.  However, I 
agree it makes sense to have more and start them 'more synchronously'.


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1887
PS10, Line 1887:     const auto sleep_sec = 5 + (r.Next() % 3);
> Seems like we've picked a particular range of sleeping durations that we th
Yep, that was about having some 'precise' timing.  The idea was that the 
desired timing would be so for almost all runs, and that seems to be true even 
with dist-test and TSAN builds.

Flipping a tserver on/off is not exactly what is needed here, and I'm not sure 
it's possible to track the  'inacceptable health status UNAVAILABLE' error in 
that case reliably at all.

I updated the scenario to rely on the state of the cluster that 'kudu ksck' 
sees when the rebalancer runs, making this scenario much more robust.


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1902
PS10, Line 1902:       "--max_moves_per_server=2",
> Why is this flag important?
Done


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1975
PS10, Line 1975:   SleepFor(MonoDelta::FromSeconds(3 + (r.Next() % 3)));
> See above; this also seems too delicately chosen to be robust.
Done



--
To view, visit http://gerrit.cloudera.org:8080/10540
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I78b3dcea71ed303f6ecd199604b2385796d05da8
Gerrit-Change-Number: 10540
Gerrit-PatchSet: 10
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Will Berkeley <[email protected]>
Gerrit-Comment-Date: Mon, 02 Jul 2018 21:26:08 +0000
Gerrit-HasComments: Yes

Reply via email to