Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/14222 )
Change subject: KUDU-2069 p4: stop replication from failed servers in maintenance mode ...................................................................... Patch Set 2: (7 comments) http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc File src/kudu/consensus/quorum_util-test.cc: http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@636 PS2, Line 636: TEST_P(QuorumUtilHealthPolicyParamTest, ShouldAddReplicaWhitelist) { What about tests scenarios for ShouldRemoveReplica() when some of the tablet servers hosting replicas are put into the maintenance mode? http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@648 PS2, Line 648: unknown server server with unknown health http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@689 PS2, Line 689: } Does it make sense to add the following cases for replication factor 3: * A:? B:+ C:+, all voters, where A is in maintenance mode * A:- B:+ C:+, all voters, where A, B, and C in maintenance mode * A:- B:- C:+, all voters, where A and B in maintenance mode Also, how does the maintenance mode affect configurations with non-voter replicas? Imagine, some non-voter replica was in process of copying when a tablet server with the failed replica was marked/put into the maintenance. What will happen to the non-voter replica? Will it be removed since the tablet server with the failed voter replica has just been put into the maintenance mode? http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc File src/kudu/integration-tests/maintenance_mode-itest.cc: http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@19 PS2, Line 19: nit: remove the extra line http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@156 PS2, Line 156: TestMaintenanceModeDoesntRereplicate > nit: TestMaintenanceModeWithoutRereplication? Does it make sense to add a scenario to verify that it's possible to move replicas _from_ a tablet server put into the maintenance mode? http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@195 PS2, Line 195: // Bringing the tablet server down shouldn't lead to re-replication. Another reasons for re-replication are: * follower replica falling behind leader's WAL GC threshold * replica failed to bootstrap due to IO error Does it make sense to add test scenario for those or we assume that's already covered due to the way how the code is written in quorum_util.cc? If the latter, then please add a comment to reflect those possible cases as well. http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/catalog_manager.cc@4190 PS2, Line 4190: unordered_set<string> whitelisted_uuids; : master_->ts_manager()->GetWhitelistedUuids(&whitelisted_uuids); > It's wasteful to generate this for each tablet in the report. +1 -- To view, visit http://gerrit.cloudera.org:8080/14222 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db Gerrit-Change-Number: 14222 Gerrit-PatchSet: 2 Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Hao Hao <hao....@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Fri, 13 Sep 2019 07:47:45 +0000 Gerrit-HasComments: Yes