Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/14222 )
Change subject: KUDU-2069 p4: stop replication from failed servers in maintenance mode ...................................................................... Patch Set 5: (9 comments) Overall looks good to me; a few nits. http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc File src/kudu/consensus/quorum_util-test.cc: http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@636 PS2, Line 636: EXPECT_FALSE(ShouldAddReplica(config, 5, { "A", "B" })); > This is somewhat white-box knowledge, but the "whitelist" isn't even part o I think you meant the set of servers to ignore is not a part of ShouldEvictReplica(): yes, indeed. For some reason, it seemed to me that ShouldEvictReplica() is also affected by the maintenance mode. http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@689 PS2, Line 689: EXPECT_FALSE(ShouldAddReplica(config, 5, { "A", "B" })); > Sure, I'll add those cases. Great, thanks a lot. http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc File src/kudu/consensus/quorum_util-test.cc: http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@645 PS5, Line 645: { : RaftConfigPB config; : AddPeer(&config, "A", V, '+', {{"REPLACE", true}}); : AddPeer(&config, "B", V, '+'); : AddPeer(&config, "C", V, '+'); : EXPECT_TRUE(ShouldAddReplica(config, 3, { "A" })); : } : { : RaftConfigPB config; : AddPeer(&config, "A", V, '-', {{"REPLACE", true}}); : AddPeer(&config, "B", V, '+'); : AddPeer(&config, "C", V, '+'); : // Even if the replica to replace is meant to be ignored on failure, we : // should honor the replacement and try to add a replica. : EXPECT_TRUE(ShouldAddReplica(config, 3, { "A" })); : } nit: maybe, put this under for () cycle to iterate over the health status of replica "A"? You could add the verification for the "unknown" health status of replica as well easily. http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@675 PS5, Line 675: - Does it make sense to add coverage for other health statuses as well (unknown, healthy) to explicitly state how the maintenance mode is interpreted for those? http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@684 PS5, Line 684: AddPeer(&config, "A", N, '-', {{"PROMOTE", true}}); : AddPeer(&config, "B", N, '-', {{"PROMOTE", true}}); ditto http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util.cc File src/kudu/consensus/quorum_util.cc: http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util.cc@455 PS5, Line 455: if (VLOG_IS_ON(2) && ignore_failure_for_underreplication) { : VLOG(2) << Substitute("ignoring $0 if failed", peer_uuid); : } Does it deserve VLOG(1) it's not important at all? http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc File src/kudu/integration-tests/maintenance_mode-itest.cc: http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@269 PS5, Line 269: TEST_F(MaintenanceModeRF3ITest, TestMaintenanceModeDoesntObstructMove) { Thank you for adding this scenario. http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@379 PS5, Line 379: const TServerDetails* maintenance_details = FindOrDie(ts_map, maintenance_uuid); : vector<string> mnt_tablet_ids; : ASSERT_OK(ListRunningTabletIds(maintenance_details, MonoDelta::FromSeconds(30), &mnt_tablet_ids)); : ASSERT_EQ(kNumTablets, mnt_tablet_ids.size()); Could it happen that replicas at the server fell behind WAL GC threshold since the workload is running in background? If so, there might be an interval when such replicas first evicted, and then new replicas appear, replacing the evicted ones. That might bring some flakiness at line 382, so maybe it's worth wrapping lines 380-382 into ASSERT_EVENTUALLY in that case. http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@385 PS5, Line 385: NO_FATALS(create_table.StopAndJoin()); Does it make sense to verify the number of written rows after stopping the workload or that's too much? Something like ClusterVerifier::CheckRowCount() -- To view, visit http://gerrit.cloudera.org:8080/14222 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db Gerrit-Change-Number: 14222 Gerrit-PatchSet: 5 Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Hao Hao <hao....@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Fri, 27 Sep 2019 17:44:22 +0000 Gerrit-HasComments: Yes