Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/11364 )
Change subject: [tests] make master-stress-test more stable ...................................................................... Patch Set 2: (7 comments) http://gerrit.cloudera.org:8080/#/c/11364/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11364/2//COMMIT_MSG@9 PS2, Line 9: The master-stress-test has been flaky for some time. After looking : at those failure closely, I found about five different issues. This : patch addresses the most prominent one: failures of the test scenario : because of timeouts errors in case of TSAN builds. The timeout errors : were induced by frequent RPC queue overflows. Thanks for tackling this. Are you able to quantify how much this particular patch (vs. the others) deflakes the test? http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc File src/kudu/integration-tests/master-stress-test.cc: http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@140 PS2, Line 140: // Make the Raft heartbeat interval shorter to allow for faster detection : // of failed/restarted leader masters. : opts.extra_master_flags.emplace_back("--raft_heartbeat_interval_ms=500"); Isn't this the default value? http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@161 PS2, Line 161: opts.extra_tserver_flags.emplace_back("--raft_heartbeat_interval_ms=500"); See above. http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@327 PS2, Line 327: if (s.IsNetworkError()) { : continue; : } : if (tablet_ids.empty()) { : continue; : } Should we backoff in both of these cases? http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@415 PS2, Line 415: it's crucial to keep the : // master up All masters? Or just the new one? If the latter, maybe we should modify this to kill masters in a round robin instead of randomly. I'm just a little concerned that a 2s sleep is quite long when the test only runs for 5s in normal mode. http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@492 PS2, Line 492: FLAGS_timeout_ms = kDefaultAdminTimeout.ToMilliseconds(); We don't use the CLI in this test, do we? If this is for the LeaderMasterProxy, can you note that in a comment? http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/mini-cluster/external_mini_cluster.h File src/kudu/mini-cluster/external_mini_cluster.h: http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/mini-cluster/external_mini_cluster.h@159 PS2, Line 159: // an incomplete connection negotiation will timeout. Can you note the default here? -- To view, visit http://gerrit.cloudera.org:8080/11364 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf Gerrit-Change-Number: 11364 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Will Berkeley <wdberke...@gmail.com> Gerrit-Comment-Date: Thu, 30 Aug 2018 20:31:00 +0000 Gerrit-HasComments: Yes