Mike Percy has posted comments on this change. Change subject: KUDU-1330: Add a tool to unsafely recover from loss of majority replicas ......................................................................
Patch Set 16: (34 comments) I did one quick nit pass for everything except tests. I caught a couple minor issues, but this rev is mostly asking for changes to comments and log messages. Almost there. http://gerrit.cloudera.org:8080/#/c/6066/16/src/kudu/consensus/consensus.proto File src/kudu/consensus/consensus.proto: Line 494: // The raft config sent to destination server. Add: Only the 'permanent_uuid' of each peer in the config is required (address-related information is ignored by the server). The peers specified in 'new_config' are required to be a subset (or equal to) the peers in the committed config on the destination replica. http://gerrit.cloudera.org:8080/#/c/6066/16/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: PS16, Line 553: : nit: add space after colon PS16, Line 554: : and here PS16, Line 777: : nit: space after colon here and elsewhere in this log message http://gerrit.cloudera.org:8080/#/c/6066/16/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: PS16, Line 1167: : space after colon here and in the rest of this commit message PS16, Line 1330: nit: Add a period before this space. PS16, Line 1594: Node Replica PS16, Line 1594: so new config will be appended. but the new config will be unsafely changed anyway. PS16, Line 1595: : ": " Line 1596: << state_->GetPendingConfigUnlocked().DebugString(); Add: << ", New config: " << SecureShortDebugString(req.new_config()); PS16, Line 1596: ebugString use SecureShortDebugString() instead PS16, Line 1605: and that node is a VOTER in the config remove this PS16, Line 1606: This is to prevent the API allowing an invalid uuid into the new config. Consider changing this to: This allows a manual recovery tool to only have to specify the uuid of each replica in the new config without having to know the addresses of each server (since we can get the address information from the committed config). Additionally, only a subset of the committed config is required for typical cluster repair scenarios. PS16, Line 1616: for tablet $1 for tablet $1. Committed config: $2 ...and include the committed config in this message using ShortDebugString() PS16, Line 1630: we don't want to enforce unsafe config that way Let's replace this part of the comment with: "it is rare and a replica without itself in the latest config is definitely not caught up with the latest leader's log." PS16, Line 1633: committed new PS16, Line 1635: tablet $1 tablet $1. Rejected config: $2 and pass in SecureShortDebugString(new_config) PS16, Line 1639: preceding_opid.index() + 1 Define replicate_opid_index up here and use it here as well as down below. PS16, Line 1660: leader_term let's rename this variable 'new_term' PS16, Line 1680: DVLOG(1) << "Consensus request: " VLOG_WITH_PREFIX(3) << "UnsafeChangeConfig: Generated consensus request: " Line 1681: DVLOG(1) << "Replicate Msg: " << SecureShortDebugString(*replicate); Remove this line since it's redundant with the line above PS16, Line 1684: , nit: add space after comma PS16, Line 1685: DebugString() SecureShortDebugString() here and below PS16, Line 1685: : space after colon, not before PS16, Line 1686: : here also: space after colon, not before http://gerrit.cloudera.org:8080/#/c/6066/16/src/kudu/consensus/raft_consensus_state.cc File src/kudu/consensus/raft_consensus_state.cc: PS16, Line 241: ( nit: Remove extra pair of parentheses here PS16, Line 245: << Substitute("New committed config must equal pending config, but does not. " : "Pending config: $0, committed config: $1", : SecureShortDebugString(pending_config), : SecureShortDebugString(config_to_commit)); nit: it seems like the indentation is a little messed up on these 4 lines now. Can you re-indent it and fix it to look like it did before? http://gerrit.cloudera.org:8080/#/c/6066/16/src/kudu/tools/tool_action_remote_replica.cc File src/kudu/tools/tool_action_remote_replica.cc: Line 45: #include "kudu/tools/tool_test_util.h" Is this used? It looks like a test utility library? PS16, Line 366: LOG(WARNING) << "NOTE: the new config may be replicated asynchronously " : << "to other peers and it may take a while to bring the tablet " : << "back to majority replica count depending upon how much time " : << "the new peer replicas take to catch up with the tablet server " : << "enforcing the new config."; I think we should remove this message. I think it is confusing because it talks about replication when this tool is a single-node tool. Instead of messages like this, we should just provide good documentation on the web site for people doing cluster troubleshooting that need a manual repair tool. PS16, Line 378: MonoDelta::FromSeconds(30) MonoDelta::FromMilliseconds(FLAGS_timeout_ms) http://gerrit.cloudera.org:8080/#/c/6066/16/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: PS16, Line 885: DVLOG(1) how about: VLOG(2) PS16, Line 909: DVLOG(1) how about: VLOG(2) PS16, Line 910: ChangeConfig UnsafeChangeConfig PS16, Line 923: s Status s -- To view, visit http://gerrit.cloudera.org:8080/6066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I908d8c981df74d56dbd034e72001d379fb314700 Gerrit-PatchSet: 16 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
