Adar Dembo has posted comments on this change. Change subject: [tools] Add a 'kudu tablet relocate' tool ......................................................................
Patch Set 4: (22 comments) http://gerrit.cloudera.org:8080/#/c/7444/4//COMMIT_MSG Commit Message: Line 18: tablet bootstraps. As a result of this extra fragility, the tool What sort of guidance should the tool provide w.r.t. the state of the world if relocate_replica is interrupted midway through? PS4, Line 33: Change-Id: : I8684e4826bfeaf36b31d297ec1e49897705867f1 Didn't mean to include this? http://gerrit.cloudera.org:8080/#/c/7444/4/src/kudu/tools/ksck-test.cc File src/kudu/tools/ksck-test.cc: Line 18: #include <iosfwd> Still need this? Line 131: ~KsckTest() {} Can remove this now? http://gerrit.cloudera.org:8080/#/c/7444/4/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: Line 268: // Blocks until either the number of results plus errors reported equals Update with mention of what 'out' does. http://gerrit.cloudera.org:8080/#/c/7444/4/src/kudu/tools/ksck.h File src/kudu/tools/ksck.h: Line 386: Ksck(std::shared_ptr<KsckCluster> cluster, std::ostream& out) Maybe just add this to the existing constructor with a default value of std::cout? Also, shouldn't we store a pointer to the ostream to adhere to our style guidelines? http://gerrit.cloudera.org:8080/#/c/7444/4/src/kudu/tools/ksck_remote-test.cc File src/kudu/tools/ksck_remote-test.cc: Line 19: #include <sstream> Don't need this? Line 61: ~RemoteKsckTest() {} Can omit? http://gerrit.cloudera.org:8080/#/c/7444/4/src/kudu/tools/kudu-admin-test.cc File src/kudu/tools/kudu-admin-test.cc: PS4, Line 189: TabletServerMap::const_iterator Can use auto type here? PS4, Line 206: workload.set_timeout_allowed(true); : workload.set_write_timeout_millis(10000); : workload.set_num_replicas(FLAGS_num_replicas); : workload.set_num_write_threads(1); : workload.set_write_batch_size(1); For the non-obvious non-default values (basically for everything except num_replicas) please either remove them or add a comment explaining why they're necessary/useful for this test. Line 215: int num_moves = AllowSlowTests() ? 3 : 1; How long does the test take to run when num_moves=1? When num_moves=3? PS4, Line 236: ClusterVerifier v(cluster_.get()); : NO_FATALS(v.CheckCluster()); Would it be useful to run this after every move rather than just at the end? Or does it not matter? http://gerrit.cloudera.org:8080/#/c/7444/4/src/kudu/tools/tool_action_tablet.cc File src/kudu/tools/tool_action_tablet.cc: Line 45: Do you think it'd be more useful to express these timeouts in terms of seconds, since both operations are expected to take a non-trivial amount of time? Line 46: DEFINE_int64(relocate_copy_timeout_ms, 30000, 30s doesn't seem like enough time for a tablet copy when tablets may be gigabytes in size. Any reason for this not to be a very high value, in the multiple minutes range? Line 123: std::shared_ptr<KsckMaster> master; Let's add using std::shared_ptr. Line 130: std::shared_ptr<Ksck> ksck(new Ksck(cluster, null_stream)); Why do we need to allocate Ksck on the heap and store it in a shared_ptr? Can't we just stack allocate it? Line 334: while (MonoTime::Now() < deadline) { I think it'd be nice to reuse proxies in these loops, to avoid the overhead of repeated DNS lookups. Maybe in WaitForCleanKsck too. Line 356: SleepFor(MonoDelta::FromMilliseconds(500)); Why not 1000 ms here? Line 366: const string& rem_replica_uuid = FindOrDie(context.required_args, kRemReplicaUuidArg); See below: "replica UUID" is a misnomer since this UUID uniquely identifies the tserver, not a particular replica on the tserver. PS4, Line 387: RETURN_NOT_OK(ChangeLeader(client, tablet_id, : leader_uuid, leader_hp, : MonoDelta::FromMilliseconds(FLAGS_relocate_leader_timeout_ms))); Why is this step necessary? Won't a new leader automatically be elected if the replica removed is the old leader? Line 434: .AddRequiredParameter({ kRemReplicaUuidArg, "UUID of the replica to be removed" }) These are server UUIDs though, not tablet UUIDs, right? Line 451: .AddAction(std::move(relocate_replica)) Nit: should come before remove_replica (alphabetical sort). -- To view, visit http://gerrit.cloudera.org:8080/7444 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3b7a7243333ba6e6a3d6fce96b220224d6e38a84 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will Berkeley <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-HasComments: Yes
