[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

Alexey Serbin (Code Review) Fri, 27 Sep 2019 10:44:36 -0700

Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )


Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
......................................................................


Patch Set 5:

(9 comments)

Overall looks good to me; a few nits.

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc
File src/kudu/consensus/quorum_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@636
PS2, Line 636:     EXPECT_FALSE(ShouldAddReplica(config, 5, { "A", "B" }));
> This is somewhat white-box knowledge, but the "whitelist" isn't even part o
I think you meant the set of servers to ignore is not a part of 
ShouldEvictReplica(): yes, indeed.  For some reason, it seemed to me that 
ShouldEvictReplica() is also affected by the maintenance mode.


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@689
PS2, Line 689:     EXPECT_FALSE(ShouldAddReplica(config, 5, { "A", "B" }));
> Sure, I'll add those cases.
Great, thanks a lot.


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc
File src/kudu/consensus/quorum_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@645
PS5, Line 645:   {
             :     RaftConfigPB config;
             :     AddPeer(&config, "A", V, '+', {{"REPLACE", true}});
             :     AddPeer(&config, "B", V, '+');
             :     AddPeer(&config, "C", V, '+');
             :     EXPECT_TRUE(ShouldAddReplica(config, 3, { "A" }));
             :   }
             :   {
             :     RaftConfigPB config;
             :     AddPeer(&config, "A", V, '-', {{"REPLACE", true}});
             :     AddPeer(&config, "B", V, '+');
             :     AddPeer(&config, "C", V, '+');
             :     // Even if the replica to replace is meant to be ignored on 
failure, we
             :     // should honor the replacement and try to add a replica.
             :     EXPECT_TRUE(ShouldAddReplica(config, 3, { "A" }));
             :   }
nit: maybe, put this under for () cycle to iterate over the health status of 
replica "A"?  You could add the verification for the "unknown" health status of 
replica as well easily.


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@675
PS5, Line 675: -
Does it make sense to add coverage for other health statuses as well (unknown, 
healthy) to explicitly state how the maintenance mode is interpreted for those?


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@684
PS5, Line 684:     AddPeer(&config, "A", N, '-', {{"PROMOTE", true}});
             :     AddPeer(&config, "B", N, '-', {{"PROMOTE", true}});
ditto


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util.cc
File src/kudu/consensus/quorum_util.cc:

http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util.cc@455
PS5, Line 455:     if (VLOG_IS_ON(2) && ignore_failure_for_underreplication) {
             :       VLOG(2) << Substitute("ignoring $0 if failed", peer_uuid);
             :     }
Does it deserve VLOG(1) it's not important at all?


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc
File src/kudu/integration-tests/maintenance_mode-itest.cc:

http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@269
PS5, Line 269: TEST_F(MaintenanceModeRF3ITest, 
TestMaintenanceModeDoesntObstructMove) {
Thank you for adding this scenario.


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@379
PS5, Line 379:   const TServerDetails* maintenance_details = FindOrDie(ts_map, 
maintenance_uuid);
             :   vector<string> mnt_tablet_ids;
             :   ASSERT_OK(ListRunningTabletIds(maintenance_details, 
MonoDelta::FromSeconds(30), &mnt_tablet_ids));
             :   ASSERT_EQ(kNumTablets, mnt_tablet_ids.size());
Could it happen that replicas at the server fell behind WAL GC threshold since 
the workload is running in background?  If so, there might be an interval when 
such replicas first evicted, and then new replicas appear, replacing the 
evicted ones.  That might bring some flakiness at line 382, so maybe it's worth 
wrapping lines 380-382 into ASSERT_EVENTUALLY in that case.


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@385
PS5, Line 385:   NO_FATALS(create_table.StopAndJoin());
Does it make sense to verify the number of written rows after stopping the 
workload or that's too much?  Something like ClusterVerifier::CheckRowCount()



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Hao Hao <hao....@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Fri, 27 Sep 2019 17:44:22 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

Reply via email to