Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11364 )

Change subject: [tests] make master-stress-test more stable
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/11364/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11364/2//COMMIT_MSG@9
PS2, Line 9: The master-stress-test has been flaky for some time.  After looking
           : at those failure closely, I found about five different issues.  
This
           : patch addresses the most prominent one: failures of the test 
scenario
           : because of timeouts errors in case of TSAN builds.  The timeout 
errors
           : were induced by frequent RPC queue overflows.
Thanks for tackling this. Are you able to quantify how much this particular 
patch (vs. the others) deflakes the test?


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc
File src/kudu/integration-tests/master-stress-test.cc:

http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@140
PS2, Line 140:     // Make the Raft heartbeat interval shorter to allow for 
faster detection
             :     // of failed/restarted leader masters.
             :     
opts.extra_master_flags.emplace_back("--raft_heartbeat_interval_ms=500");
Isn't this the default value?


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@161
PS2, Line 161:     
opts.extra_tserver_flags.emplace_back("--raft_heartbeat_interval_ms=500");
See above.


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@327
PS2, Line 327:       if (s.IsNetworkError()) {
             :         continue;
             :       }
             :       if (tablet_ids.empty()) {
             :         continue;
             :       }
Should we backoff in both of these cases?


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@415
PS2, Line 415: it's crucial to keep the
             :       // master up
All masters? Or just the new one?

If the latter, maybe we should modify this to kill masters in a round robin 
instead of randomly. I'm just a little concerned that a 2s sleep is quite long 
when the test only runs for 5s in normal mode.


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@492
PS2, Line 492:   FLAGS_timeout_ms = kDefaultAdminTimeout.ToMilliseconds();
We don't use the CLI in this test, do we? If this is for the LeaderMasterProxy, 
can you note that in a comment?


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/mini-cluster/external_mini_cluster.h
File src/kudu/mini-cluster/external_mini_cluster.h:

http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/mini-cluster/external_mini_cluster.h@159
PS2, Line 159:   // an incomplete connection negotiation will timeout.
Can you note the default here?



--
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wdberke...@gmail.com>
Gerrit-Comment-Date: Thu, 30 Aug 2018 20:31:00 +0000
Gerrit-HasComments: Yes

Reply via email to