[kudu-CR] [tools] Add a 'kudu tablet relocate' tool

Will Berkeley (Code Review) Sat, 22 Jul 2017 20:43:26 -0700

Will Berkeley has posted comments on this change.

Change subject: [tools] Add a 'kudu tablet relocate' tool
......................................................................



Patch Set 4:

(22 comments)

http://gerrit.cloudera.org:8080/#/c/7444/4//COMMIT_MSG
Commit Message:

Line 18: tablet bootstraps. As a result of this extra fragility, the tool
> What sort of guidance should the tool provide w.r.t. the state of the world
I added some information about this in an extra description for the action. It 
doesn't cover cases where the tablet somehow gets stuck due to exogenous 
factors, but it covers the most common cases where one shouldn't just re-run 
the tool: when the tool fails but the new server gets successfully added 
eventually, and when the tools fails and the new server is not added because 
the copy fails.


PS4, Line 33: Change-Id:
            : I8684e4826bfeaf36b31d297ec1e49897705867f1
> Didn't mean to include this?
Done


http://gerrit.cloudera.org:8080/#/c/7444/4/src/kudu/tools/ksck-test.cc
File src/kudu/tools/ksck-test.cc:

Line 18: #include <iosfwd>
> Still need this?
Done


Line 131:   ~KsckTest() {}
> Can remove this now?
Done


http://gerrit.cloudera.org:8080/#/c/7444/4/src/kudu/tools/ksck.cc
File src/kudu/tools/ksck.cc:

Line 268:   // Blocks until either the number of results plus errors reported 
equals
> Update with mention of what 'out' does.
Done


http://gerrit.cloudera.org:8080/#/c/7444/4/src/kudu/tools/ksck.h
File src/kudu/tools/ksck.h:

Line 386:   Ksck(std::shared_ptr<KsckCluster> cluster, std::ostream& out)
> Maybe just add this to the existing constructor with a default value of std
Done


http://gerrit.cloudera.org:8080/#/c/7444/4/src/kudu/tools/ksck_remote-test.cc
File src/kudu/tools/ksck_remote-test.cc:

Line 19: #include <sstream>
> Don't need this?
Done


Line 61:   ~RemoteKsckTest() {}
> Can omit?
Done


http://gerrit.cloudera.org:8080/#/c/7444/4/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS4, Line 189: TabletServerMap::const_iterator
> Can use auto type here?
Done


PS4, Line 206:   workload.set_timeout_allowed(true);
             :   workload.set_write_timeout_millis(10000);
             :   workload.set_num_replicas(FLAGS_num_replicas);
             :   workload.set_num_write_threads(1);
             :   workload.set_write_batch_size(1);
> For the non-obvious non-default values (basically for everything except num
Done


Line 215:   int num_moves = AllowSlowTests() ? 3 : 1;
> How long does the test take to run when num_moves=1? When num_moves=3?
num_moves = 1 about 3 seconds though can vary a bit; with num_moves = 3 it's 10 
or so.


PS4, Line 236:   ClusterVerifier v(cluster_.get());
             :   NO_FATALS(v.CheckCluster());
> Would it be useful to run this after every move rather than just at the end
CheckCluster() is basically a checksum ksck plus a check of the opids. We're 
running ksck as part of the tool action and we have to wait in the loop for the 
opids to converge so we know that the new server has applied the new config and 
ksck will not see a consensus conflict, so I think we effectively run it as 
part of the loop :)


http://gerrit.cloudera.org:8080/#/c/7444/4/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

Line 45: 
> Do you think it'd be more useful to express these timeouts in terms of seco
Done


Line 46: DEFINE_int64(relocate_copy_timeout_ms, 30000,
> 30s doesn't seem like enough time for a tablet copy when tablets may be gig
I upped it to ten minutes.


Line 123:   std::shared_ptr<KsckMaster> master;
> Let's add using std::shared_ptr.
Done


Line 130:   std::shared_ptr<Ksck> ksck(new Ksck(cluster, null_stream));
> Why do we need to allocate Ksck on the heap and store it in a shared_ptr? C
Done


Line 334:   while (MonoTime::Now() < deadline) {
> I think it'd be nice to reuse proxies in these loops, to avoid the overhead
Done


Line 356:     SleepFor(MonoDelta::FromMilliseconds(500));
> Why not 1000 ms here?
Just hoping to be a bit faster since replicating an op should be faster than 
holding an election, and asking for the current opid from one server is less 
onerous than running ksck.


Line 366:   const string& rem_replica_uuid = FindOrDie(context.required_args, 
kRemReplicaUuidArg);
> See below: "replica UUID" is a misnomer since this UUID uniquely identifies
You are technically correct (the best kind of correct), but the name makes 
sense to me in context because the UUID identifies the replica uniquely within 
the tablet. However, a user might not have the same mental context and "tserver 
uuid" would likely be clearer for them. I'd like to hear what Mike thinks about 
renaming "replica UUID" -> "server UUID" or "tablet server UUID".


PS4, Line 387:     RETURN_NOT_OK(ChangeLeader(client, tablet_id,
             :                                leader_uuid, leader_hp,
             :                                
MonoDelta::FromMilliseconds(FLAGS_relocate_leader_timeout_ms)));
> Why is this step necessary? Won't a new leader automatically be elected if 
It's disallowed to remove a leader from a configuration. The leader needs to be 
told to step down so another server can become leader and execute the config 
change kicking the old leader out.

This is a restriction of our current implementation, not Raft itself. If/when 
we implement leadership transfer, we could have the leader step down gracefully 
as part of a config change removing it from the config.


Line 434:       .AddRequiredParameter({ kRemReplicaUuidArg, "UUID of the 
replica to be removed" })
> These are server UUIDs though, not tablet UUIDs, right?
See above comment.


Line 451:       .AddAction(std::move(relocate_replica))
> Nit: should come before remove_replica (alphabetical sort).
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7444
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3b7a7243333ba6e6a3d6fce96b220224d6e38a84
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Will Berkeley <[email protected]>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Add a 'kudu tablet relocate' tool

Reply via email to