Dinesh Bhat has posted comments on this change.

Change subject: KUDU-1330: Add a tool to unsafely recover from loss of majority 
replicas
......................................................................


Patch Set 7:

(27 comments)

http://gerrit.cloudera.org:8080/#/c/6066/7/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

PS7, Line 488: 4
> You can just number this starting from 1
Done.


PS7, Line 492: sender_id
> How about caller_uuid or caller_id
Done


PS7, Line 497: required RaftPeerPB server
> Comment needs update?
Done


PS7, Line 506: new_config
> Is this used for something? Regardless, see my comment above, it seems to m
Done


http://gerrit.cloudera.org:8080/#/c/6066/7/src/kudu/consensus/metadata.proto
File src/kudu/consensus/metadata.proto:

Line 93:   optional bool allow_unsafe = 4;
> I think this belongs in ChangeConfigRecordPB
I guess you changed your mind about this suggestion later if I am correct.


http://gerrit.cloudera.org:8080/#/c/6066/7/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 1148:     if (state_->GetActiveRoleUnlocked() != RaftPeerPB::LEADER) {
> This change can be reverted, right?
yeah, I was about to, missed in previous patch. Done.


PS7, Line 1328: .
> nit: missing space after period
Done


PS7, Line 1328: UNLOCKED
> why _UNLOCKED?
Done, yeah that was not intentional.


Line 1563:                                          
UnsafeChangeConfigResponsePB *resp,
> warning: parameter 'resp' is unused [misc-unused-parameters]
Done


Line 1598:   string replica_uuid = peer_pb.permanent_uuid();
> warning: the variable 'replica_uuid' is copy-constructed from a const refer
Done


Line 1598:   string replica_uuid = peer_pb.permanent_uuid();
> Based on my comments elsewhere I think this removal logic should probably h
I believe there are substantial changes here based on our offline discussions, 
so please re-review this portion.


Line 1630:   new_config.add_peers()->CopyFrom(new_peer);
> So... we are only allowing changing to a single-node config?
That has changed now, the CLI/API takes the new config itself.


PS7, Line 1651: current_term + 1
> please extract this into a new variable for use here and on line 1664
Done


PS7, Line 1665: preceding_opid.index() + 1
> Please extract a variable for this so it doesn't look like a magic number. 
Done


PS7, Line 1667: msg_timestamp
> Don't we need this to be higher
higher than ? can you explain bit more what you have in mind ?


Line 1672:   LOG_WITH_PREFIX(WARNING)
> I don't think it's necessary for all caps in the server log. If you want to
Done


Line 1680:     s = StatusFromPB(consensus_resp.error().status());
> Is there a specific error type you are looking for here?
Oh yeah, definitely in my radar, updated the commit_msg accordingly.


http://gerrit.cloudera.org:8080/#/c/6066/8/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS8, Line 1612:   // Take the snapshot of the queue state and timestamp to 
stick them
              :   // in the consensus request later.
              :   int64_t all_replicated_index = 
queue_->GetAllReplicatedIndex();
              :   int64 last_committed_index = queue_->GetCommittedIndex();
              :   OpId preceding_opid = queue_->GetLastOpIdInLog();
              :   uint64 msg_timestamp = time_manager_->GetSafeTime().value();
> needs to be done under the lock.
David, sorry still not following. Did you mean lock for timestamp ?


PS8, Line 1672: LOG_WITH_PREFIX(WARNING)
              :         << "REPLACING THE CONFIG ON THIS SERVER WITH A NEW 
CONFIG,"
              :         << "THIS OPERATION IS IRREVERSIBLE !!\n"
              :         << "COMMITTED CONFIG :" << 
committed_config.DebugString()
              :         << "NEW CONFIG :" << new_config.DebugString();
> add one like this with the result of the actual change config.
Actually, the fact that we have written this new_config doesn't mean that 
cluster came back up and tablet became operational. Tool throws a warning today 
quoting this operation is asynchronous and lot can happen in the background, 
longer wait, etc. As such, there is nothing else to print here apart from 
config being overwritten(new_config). Or did you have something else in mind ?


http://gerrit.cloudera.org:8080/#/c/6066/7/src/kudu/consensus/raft_consensus_state.cc
File src/kudu/consensus/raft_consensus_state.cc:

PS7, Line 226: committed_config
> As we discussed offline, it would be an improvement to rename this argument
Done


Line 238:   if (!committed_config.allow_unsafe()) {
> This doesn't seem right to me. I think this check should only be skipped if
Actually, there is a chicken-and-egg kinda issue here; I have tried to make 
this clear in comments now. We should check for equality of config only when we 
know both pending and to-be-committed config match wrt their 'unsafe' flag. Let 
me know if you want the comments to be more clearer.


http://gerrit.cloudera.org:8080/#/c/6066/7/src/kudu/integration-tests/cluster_itest_util.h
File src/kudu/integration-tests/cluster_itest_util.h:

Line 167: Status WaitForReplicasReportedToMaster(
> Doesn't ExternalMiniCluster::WaitForTabletServerCount() already implement t
It looks like signatures are different. Besides, this was a consolidation; 
i.e., moved the definition of this from a different source file to here. If 
there is duplication I will follow up a different patch to remove this.


http://gerrit.cloudera.org:8080/#/c/6066/7/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

Line 159: TEST_F(AdminCliTest, TestReplaceConfigOnSingleFollower) {
> Please add test comments with a brief summary of the test.
I will do this in an upcoming patch, I am still adding more tests and I may 
remove few redundant ones.


PS7, Line 172: find
> You need to use equal_range() here: https://stackoverflow.com/questions/904
good find, thanks, done.


Line 224: 
> Maybe shut down the master, first wait until # committed voters is 1, then 
Because #3->#1->#3 happens asynchronously, we may miss the transition to #1 by 
the we start waiting. I thought existing checks are less flaky enough.


PS7, Line 255: timeout
> How about kTimeout here and elsewhere, also I think at least 30 seconds is 
Done


Line 348: TEST_F(AdminCliTest, TestReplaceConfigOnLeaderWithPendingConfig) {
> Can we add a version for a follower with a pending config?
Actually, we can not send in a ChangeConfig API explicitly to a follower 
because it's a leader API. Follower fails with:

Bad status: Illegal state: Replica c7075ee3081d4f2fad16dcaec82479c8 is not 
leader of this config. Role: FOLLOWER. 

I haven't really looked more, but I believe there may be another way of testing 
this.


-- 
To view, visit http://gerrit.cloudera.org:8080/6066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I908d8c981df74d56dbd034e72001d379fb314700
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

Reply via email to