Dinesh Bhat has posted comments on this change. Change subject: KUDU-1330: Add a tool to unsafely recover from loss of majority replicas ......................................................................
Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/6066/1/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: PS1, Line 1556: RaftConsensus::ReplaceConfig > I'll let mike address the pending/active config concerns of this method, si TFTR David, just a heads up here that latest patch doesn't reflect all the suggestions made above, and some rationales are listed below. 1) I agree with adding checks about consensus being stuck for a while instead of letting the tool operate immediately without allowing automatic recovery. Is there a metrics we should rely on ? or should that check be a function of some peers in the config and they being unresponsive for a while ? The latter case falls under node eviction after certain timeouts, so we could wait for at least follower_unavailable_considered_failed_sec. 2) I am having second thoughts on making the API accept list of removed uuids, because we want to be able to enforce a config of {AB} on node A despite a committed config of {ACD} when C and D become dead nodes at some point. This also means that, tool also needs to input the Host and Port address endpoint of B and there are other commands to fetch that info. Currently uuid presence check is overly restrictive becz it checks if the passed uuid is in config or not and it accepts only one uuid at the moment. I need to add a unit test for the checks I have introduced in this API - i.e simulating user errors and see that they fail as expected. Having the API accept a list of uuids to be added for the new config seems like a better idea. This requires either a new RPC taking RaftConfigPB in the request directly, or ChangeConfigRequestPB can take a list of RaftPeerPB and the API can construct the new config from that list. Usual AddServer/RemoveServer can expect the size of that list to be 1. Let me know your thoughts. 3) I refactored a bit grabbing consensus states and consensus queue states as one snapshot before building the new config and consensus request for Update(). The API assumes currently that dead nodes aren't interfering, so this snapshot of the state becomes relevant when we weaken those assumptions for future revisions. -- To view, visit http://gerrit.cloudera.org:8080/6066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I908d8c981df74d56dbd034e72001d379fb314700 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
