Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10540 )

Change subject: [tools] Extra integration tests for the rebalancer
......................................................................


Patch Set 10:

(8 comments)

These new tests are pretty complex; can you make sure they're robust by looping 
them in dist-test, both with and without TSAN, and with stress threads?

http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1539
PS10, Line 1539:     ASSERT_STR_CONTAINS(out, "rebalancing is complete: cluster 
is balanced")
               :         << "stderr: " << err;
No need to check this again.


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1690
PS10, Line 1690:     do_create_tables = false;
               :     do_alter_tables = false;
               :     do_delete_tables = false;
Not sure I see the point of separating these. Wouldn't a single CountDownLatch 
be better? The nice thing about a latch is that the threads could sleep and 
check the latch at the same time with WaitFor().


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1711
PS10, Line 1711:       timer.join();
               :       timer_cleanup.cancel();
The use of another thread just to do timing seems unnecessarily complex. How 
about capturing MonoTime::Now() before the first rebalancer run, setting up a 
deadline using that value plus 30s, then waiting for that deadline here?


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1750
PS10, Line 1750:   if (!AllowSlowTests()) {
               :     LOG(WARNING) << "test is skipped; set 
KUDU_ALLOW_SLOW_TESTS=1 to run";
               :     return;
               :   }
               :
               :   const auto is_343_scheme = (GetParam() == Kudu1097::Enable);
               :   constexpr int kRepFactor = 3;
               :   constexpr int kNumTservers = 2 * (kRepFactor + 1);
               :   constexpr int kNumTables = 10;
               :   const string table_name_pattern = "rebalance_test_table_$0";
               :   constexpr int kTserverUnresponsiveMs = 3000;
               :   const vector<string> kMasterFlags = {
               :     
Substitute("--raft_prepare_replacement_before_eviction=$0", is_343_scheme),
               :     Substitute("--tserver_unresponsive_timeout_ms=$0", 
kTserverUnresponsiveMs),
               :   };
               :   const vector<string> kTserverFlags = {
               :     
Substitute("--raft_prepare_replacement_before_eviction=$0", is_343_scheme),
               :   };
               :
               :   FLAGS_num_tablet_servers = kNumTservers;
               :   FLAGS_num_replicas = kRepFactor;
               :   NO_FATALS(BuildAndStart(kTserverFlags, kMasterFlags));
               :
               :   ASSERT_OK(CreateUnbalancedTables(
               :       cluster_.get(), client_.get(), schema_, 
table_name_pattern, kNumTables,
               :       kRepFactor, kRepFactor + 1, kNumTservers, 
kTserverUnresponsiveMs));
This is common to all of these new tests; can it be shared?


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1808
PS10, Line 1808:   // Run another rebalancer concurrently in the main test 
thread.
Hmm, is this enough concurrency to be worthwhile? How long does each rebalancer 
take to run? Should we spin up more than 2 at the same time?


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1887
PS10, Line 1887:     const auto sleep_sec = 5 + (r.Next() % 3);
Seems like we've picked a particular range of sleeping durations that we think 
is precisely enough to stop a tserver mid-rebalance. That isn't particularly 
robust; what if it's a TSAN environment where everything is moving more slowly, 
and we end up shutting down the tserver well before the rebalance starts? Or 
after it ends?

What if instead we flipped the same tserver on and off repeatedly during the 
rebalance? I think that'd increase the chances of an overlap. Would it prevent 
the test from passing?


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1902
PS10, Line 1902:       "--max_moves_per_server=2",
Why is this flag important?


http://gerrit.cloudera.org:8080/#/c/10540/10/src/kudu/tools/kudu-admin-test.cc@1975
PS10, Line 1975:   SleepFor(MonoDelta::FromSeconds(3 + (r.Next() % 3)));
See above; this also seems too delicately chosen to be robust.



--
To view, visit http://gerrit.cloudera.org:8080/10540
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I78b3dcea71ed303f6ecd199604b2385796d05da8
Gerrit-Change-Number: 10540
Gerrit-PatchSet: 10
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Will Berkeley <[email protected]>
Gerrit-Comment-Date: Thu, 21 Jun 2018 23:36:36 +0000
Gerrit-HasComments: Yes

Reply via email to