Will Berkeley has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11251 )

Change subject: KUDU-2245 Graceful leadership transfer
......................................................................


Patch Set 10:

(9 comments)

I tried doing a bit of a refactor on the CheckMoveComplete code but maybe I do 
not understand it as well as I thought. Maybe we should chat about it tomorrow?

http://gerrit.cloudera.org:8080/#/c/11251/10/src/kudu/tools/tool_replica_util.cc
File src/kudu/tools/tool_replica_util.cc:

http://gerrit.cloudera.org:8080/#/c/11251/10/src/kudu/tools/tool_replica_util.cc@a198
PS10, Line 198:
0. 'from_ts_uuid' is X (in the failure logs, X = 
51a2892fa5c1471fb173f2dda0d7d11a.


http://gerrit.cloudera.org:8080/#/c/11251/10/src/kudu/tools/tool_replica_util.cc@a206
PS10, Line 206:
              :
              :
              :
              :
1. We find out the tablet leader and build a proxy to it.

In the failure case, 'from_ts_uuid' equals 'leader_uuid' equals X. 'proxy' is 
an RPC proxy to X.


http://gerrit.cloudera.org:8080/#/c/11251/10/src/kudu/tools/tool_replica_util.cc@a259
PS10, Line 259:
              :
              :
              :
              :
              :
              :
3. We do a graceful stepdown of TS with UUID 'leader_uuid', which is X. Note 
that 'proxy' continues to point to X.


http://gerrit.cloudera.org:8080/#/c/11251/10/src/kudu/tools/tool_replica_util.cc@a282
PS10, Line 282:
              :
              :
4. We re-find the current leader and store it in 'leader_uuid', shadowing 
'leader_uuid'. If leadership has already changed and client detects this, 
'leader_uuid' is not X anymore. It's now, say, Y. Note that 'proxy' is still a 
proxy to X.

In the failure logs, Y = d56040e17c444e06b1924246d25f1263.


http://gerrit.cloudera.org:8080/#/c/11251/10/src/kudu/tools/tool_replica_util.cc@a285
PS10, Line 285:
5. This evaluates to true, since 'from_ts_uuid' is X, the original leader, but 
now 'leader_uuid' is Y.


http://gerrit.cloudera.org:8080/#/c/11251/10/src/kudu/tools/tool_replica_util.cc@a290
PS10, Line 290:
6. Uh oh. 'proxy' is a proxy to X, but 'leader_uuid', which determines the 
destination UUID of the GetConsensus RPC, is Y. Thus the RPC fails with

    Invalid argument: GetConsensusState: Wrong destination UUID requested. 
Local UUID: X. Requested UUID: Y


http://gerrit.cloudera.org:8080/#/c/11251/10/src/kudu/tools/tool_replica_util.cc@296
PS10, Line 296:     // We should consult the leader for the most up-to-date 
consensus state.
              :     string new_leader_uuid;
              :     HostPort new_leader_hp;
              :     RETURN_NOT_OK(GetTabletLeader(client, tablet_id, 
&new_leader_uuid, &new_leader_hp));
> I think this is racy, regardless of the way how leadership changes -- grace
Right, we rely on retries by the caller to handle that sort of thing. 313 is 
downstream of this leader determination and can't replace it, as I understand 
the code. We have to find the new leader else we won't know if we've 
successfully moved leadership from 'from_ts_uuid' (in the case where that's 
necessary).


http://gerrit.cloudera.org:8080/#/c/11251/10/src/kudu/tools/tool_replica_util.cc@305
PS10, Line 305: leader_uuid
> If leader changed and new leader UUID is detected, why still to keep this o
This is a mistake. We want to check if the current leader's uuid after step 
down is the same as the 'from_ts_uuid'. Thanks for finding it.


http://gerrit.cloudera.org:8080/#/c/11251/10/src/kudu/tools/tool_replica_util.cc@310
PS10, Line 310: proxy
> I think this code is executed when no DoLeaderStepDown() was called above (
Maybe some naming cleanup  or a bit of refactoring would help? 'from_ts_uuid != 
leader_uuid' on L305 is saying that (in the case when the leadership started on 
'from_ts_uuid') leadership has changed to some other peer.

I added some numbered comments to the original code to better explain what I 
think is going on, and I'll make an attempt to make it clearer what's going on 
here, at least as I understand it.



--
To view, visit http://gerrit.cloudera.org:8080/11251
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic97343af9eb349556424c999799ed5e2941f0083
Gerrit-Change-Number: 11251
Gerrit-PatchSet: 10
Gerrit-Owner: Will Berkeley <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Fengling Wang <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Will Berkeley <[email protected]>
Gerrit-Comment-Date: Mon, 24 Sep 2018 08:18:42 +0000
Gerrit-HasComments: Yes

Reply via email to