Mike Percy has posted comments on this change.

Change subject: KUDU-1330: Add a tool to unsafely recover from loss of majority 
replicas
......................................................................


Patch Set 11:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/6066/11//COMMIT_MSG
Commit Message:

Line 41: 
A description of the test cases you have implemented would be nice.


http://gerrit.cloudera.org:8080/#/c/6066/11/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

Line 495:   required RaftConfigPB new_config = 4;
Can we also add a optional int64 cas_config_opid_index like in 
ChangeConfigRequestPB?


PS11, Line 499: otherwise 'new_configuration' is set
Update comment


PS11, Line 522: Unsafe
nit: Can we change this to Force instead of Unsafe?


http://gerrit.cloudera.org:8080/#/c/6066/11/src/kudu/consensus/metadata.proto
File src/kudu/consensus/metadata.proto:

PS11, Line 93: allow_unsafe
How about forced_config_change?


http://gerrit.cloudera.org:8080/#/c/6066/7/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS7, Line 1667: pid_index);
> higher than ? can you explain bit more what you have in mind ?
You are using safetime which is in the past. Don't we want the timestamp to be 
Now()? I would like David to take a look at this.


http://gerrit.cloudera.org:8080/#/c/6066/11/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 1592:   // Check that passed replica uuids are part of the committed 
config on this node.
How about validating that the current node is a voter in the config?


Line 1601:           new_peer.last_known_addr().host() == 
committed_peer.last_known_addr().host() &&
Do we expect last_known_addr to always be passed in the new config? I thought 
we only required uuid. See below comment.


Line 1623:   RaftConfigPB new_config = config;
How about just use a copy of the committed config, but with the uuids we want 
to remove removed?

Something like:

set<string> retained_uuids;
for (const auto& peer : config.peers()) {
  const string& uuid = peer.permanent_uuid();
  if (!IsRaftConfigMember(uuid, committed_config)) {
    return Status::InvalidArgument(...);
  }
  InsertOrDie(&retained_uuids, uuid);
}
RaftConfigPB new_config = committed_config;
for (const auto& peer : committed_config) {
  const string& uuid = peer.permanent_uuid();
  if (ContainsKey(retained_uuids, uuid)) continue;
  CHECK(RemoveFromRaftConfig(&new_config, uuid));
}


http://gerrit.cloudera.org:8080/#/c/6066/7/src/kudu/consensus/raft_consensus_state.cc
File src/kudu/consensus/raft_consensus_state.cc:

Line 238:   // In the event of unsafe config change enforced from an external 
tool,
> Actually, there is a chicken-and-egg kinda issue here; I have tried to make
I don't understand that logic. If there are multiple "pending" configs, which 
this condition implies since we are committing something, then the latest 
"pending" config must be forced. Right?


http://gerrit.cloudera.org:8080/#/c/6066/11/src/kudu/consensus/raft_consensus_state.cc
File src/kudu/consensus/raft_consensus_state.cc:

PS11, Line 212: Allowing config update
How about: Allowing forced config change


PS11, Line 239: may not have a pending config
You mean that it may not match the currently-pending config, right?


PS11, Line 242: both
I don't think this will work in all cases; see below


PS11, Line 247: !config_to_commit.allow_unsafe()
As mentioned in the rev 7 thread, I don't think this condition is needed


PS11, Line 248: pending_config.allow_unsafe() && config_to_commit.allow_unsafe()
I think this should be removed due to the following scenario:

    remote_replica unsafe_change_config a.b.c.d foo A B C D E
    remote_replica unsafe_change_config a.b.c.d foo A B C
    remote_replica unsafe_change_config a.b.c.d foo A

The first 2 could still be pending and would need to be committed before the 
first one can be committed.


http://gerrit.cloudera.org:8080/#/c/6066/7/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

Line 224:     tablet_id_,
> Because #3->#1->#3 happens asynchronously, we may miss the transition to #1
The master orchestrates the part from #1 -> #2 -> #3, so if the master is dead 
we can wait for #1 to occur.


Line 348:                                                    3, tablet_id_, 
kTimeout,
> Actually, we can not send in a ChangeConfig API explicitly to a follower be
You can force a leader to step down to be a follower, while it has a pending 
config change, before running the rest of the test. The config change will 
remain pending.


http://gerrit.cloudera.org:8080/#/c/6066/11/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS11, Line 665: fault
I like the idea of using a fault to trigger committing config changes during 
tablet bootstrap.

Can we additionally test this scenario during normal operation (outside of 
bootstrap)? We could do something like this (in pseudocode):

1. Create 5-node cluster {ABCDE}. Shut down 4 replicas {BCDE}.
2. Successively apply forced config changes to node A:

    vector<string> replica_uuids = {"A", "B", "C", "D", "E"};
    for (int num_replicas = 4; num_replicas > 0;     num_replicas--) {
      vector<string> uuids_to_include;
      for (int i = 0; i < num_replicas; i++) {
        uuids_to_include.push_back(replica_uuids[i]);
      }
      ChangeConfig(ts_A, uuids_to_include);
    }


http://gerrit.cloudera.org:8080/#/c/6066/11/src/kudu/tools/tool_action_common.h
File src/kudu/tools/tool_action_common.h:

Line 95: Status ParseHostPortString(const std::string& hostport_str,
Does this handle default ports?

Do we even need this since we are only removing peers?


http://gerrit.cloudera.org:8080/#/c/6066/11/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 368:     HostPortPB new_peer_host_port_pb;
I thought we said users don't have to pass the address of the server?


-- 
To view, visit http://gerrit.cloudera.org:8080/6066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I908d8c981df74d56dbd034e72001d379fb314700
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

Reply via email to