Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11251 )
Change subject: [WIP] KUDU-2245 Graceful leadership transfer ...................................................................... Patch Set 4: (14 comments) Just skimming through. I'm going to take another look tomorrow. http://gerrit.cloudera.org:8080/#/c/11251/2/src/kudu/consensus/consensus.proto File src/kudu/consensus/consensus.proto: http://gerrit.cloudera.org:8080/#/c/11251/2/src/kudu/consensus/consensus.proto@462 PS2, Line 462: Do not set in ABRUPT mode. nit: Does it mean it's not effective in ABRUPT mode (i.e. ignored)? Maybe, change to 'Ignored in ABRUPT mode, do not set.' http://gerrit.cloudera.org:8080/#/c/11251/2/src/kudu/consensus/consensus_peers.h File src/kudu/consensus/consensus_peers.h: http://gerrit.cloudera.org:8080/#/c/11251/2/src/kudu/consensus/consensus_peers.h@82 PS2, Line 82: does not count as a pending request Not sure I understand this. Did you mean pending Raft config change? http://gerrit.cloudera.org:8080/#/c/11251/2/src/kudu/consensus/consensus_peers.cc File src/kudu/consensus/consensus_peers.cc: http://gerrit.cloudera.org:8080/#/c/11251/2/src/kudu/consensus/consensus_peers.cc@293 PS2, Line 293: & nit: is it even needed? I would expect we don't need to capture anything for that. http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/consensus/consensus_queue.cc@1040 PS4, Line 1040: TrackedPeer* peer Maybe, a const reference instead of a pointer? http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/consensus/consensus_queue.cc@1455 PS4, Line 1455: Unable to notify RaftConsensus of peer to promote. This messages seems to be outdated. http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/consensus/peer_manager.h File src/kudu/consensus/peer_manager.h: http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/consensus/peer_manager.h@66 PS4, Line 66: void Why not to return Status ? That signature would be more robust, and callers, if needed, might explicitly ignore the result status. http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/consensus/raft_consensus.cc@608 PS4, Line 608: BeginLeaderTransferPeriodUnlocked(new_leader_uuid); Is it worth verifying that new_leader_uuid != peer_uuid() prior to kicking off the process? http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/consensus/raft_consensus.cc@984 PS4, Line 984: LOG_WITH_PREFIX_UNLOCKED(INFO) << "Not signalling peer " << peer_uuid : << "to start an election: it's not a voter " : << "in the active config."; Add 'return;' after logging ? http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/consensus/raft_consensus.cc@990 PS4, Line 990: peer_manager_->StartElection(peer_uuid); What if StartElection() returns non-OK status? http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/tools/kudu-admin-test.cc File src/kudu/tools/kudu-admin-test.cc: PS4: As discussed offline, it's worth adding some negative testcases as well: * attempt transferring leadership to a replica that has just been evicted from the config * attempt transferring leadership to non-voter replica * attempt transferring leadership to a replica which fell behind WAL segment GC threshold Also, what it the behavior when asking a leader during transfer period to abruptly step down? http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/tools/kudu-admin-test.cc@1282 PS4, Line 1282: ASSERT_TRUE(s.ok()) << s.ToString() Here and elsewhere in this patch: if something goes wrong, it helps a lot to have the stderr printed out as well. http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/tools/kudu-admin-test.cc@1288 PS4, Line 1288: ASSERT_OK(GetLeaderReplicaWithRetries(tablet_id_, &master_observed_leader)); : ASSERT_TRUE(ContainsKey(possible_new_leaders, master_observed_leader->uuid())); Shouldn't RunKuduTool be under this ASSERT_EVENTUALLY as well? Don't you expect it to be flaky a bit otherwise? http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/tools/tool_replica_util.h File src/kudu/tools/tool_replica_util.h: http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/tools/tool_replica_util.h@57 PS4, Line 57: that the replica with UUID 'leader_uuid' step down. nit: the replica with UUID 'leader_uuid' to step down. http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: http://gerrit.cloudera.org:8080/#/c/11251/4/src/kudu/tserver/tablet_service.cc@1146 PS4, Line 1146: uuid nit: UUID -- To view, visit http://gerrit.cloudera.org:8080/11251 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic97343af9eb349556424c999799ed5e2941f0083 Gerrit-Change-Number: 11251 Gerrit-PatchSet: 4 Gerrit-Owner: Will Berkeley <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Fengling Wang <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Comment-Date: Wed, 29 Aug 2018 05:09:18 +0000 Gerrit-HasComments: Yes
