Hello David Ribeiro Alves, Mike Percy, Kudu Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/6066
to look at the new patch set (#3).
Change subject: KUDU-1330: Add a tool to unsafely recover from loss of majority
replicas
......................................................................
KUDU-1330: Add a tool to unsafely recover from loss of majority replicas
This patch adds an API to allow unsafe config change via an external
recovery tool 'kudu remote_replica replace_config'.
This tool lets us replace a 3-node config on a tablet server with a
1-node config. This is particularly useful when we have 2 out of 3
replicas down and we want to bring the tablet back to operational state.
We can use this tool to force a new config on the surviving node providing
all the details of the new config from the tool. As a result
of the forced config change, the automatic leader election kicks in via
raft mechanisms and the re-replication is triggered from master to bring
the replica count back upto 3-node config.
The lonely survivor talking to the tool tends to become the leader in
new config in majority of the use cases because:
a) The API/tool mimicks the actual leader the survivor node
had voted for and replicate the new config with a higher term and
bumped up opid_index. This ensures that 2 new nodes added as part of
re-replication respect the term emitted by this node and accept
this node as leader.
b) Assumption is that, the dead nodes are not coming back during this recovery,
hence the leader chosen in step a) will still be the leader when we see
the replication factor restored back to 3.
Also, the ReplaceConfig() API adds a way to abort a pending config change
because pending config comes in the way of recovery tool trying to
replicate/commit the new config on the surviving tablet server. There is only
one pending config change allowed at a time for a given tablet, hence
aborting the pending config change seems safest bet.
This patch is a first in series for unsafe config changes, and assumes that
the dead servers are not coming back while the new config change is taking
effect.
TODO:
0) Accept more replica_uuids from the command line to make support multiple
peers to be added in the new config.
1) Test with a 5-replica config forcing the old {ABCDE} to new {AB} on A.
Change-Id: I908d8c981df74d56dbd034e72001d379fb314700
---
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/tablet_service.cc
15 files changed, 545 insertions(+), 103 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/66/6066/3
--
To view, visit http://gerrit.cloudera.org:8080/6066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I908d8c981df74d56dbd034e72001d379fb314700
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot