[kudu-CR] [tools] Manual recovery tools (part 1)

Adar Dembo (Code Review) Tue, 25 Oct 2016 12:22:11 -0700

Adar Dembo has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................



Patch Set 1:

(32 comments)

http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/integration-tests/cluster_itest_util.cc
File src/kudu/integration-tests/cluster_itest_util.cc:

PS1, Line 224:   return Status::TimedOut(Substitute("Index $0 not available on 
all replicas after $1. "
             :                                               "Replicas: [ $2 ]",
             :                                               log_index, 
passed.ToString(), replicas_str));
Good catch. Could you reindent the continuation lines though?


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

PS1, Line 172: RunActionStderrString
Nit: RunActionStdoutStderrString (since it's returning both).


Line 215:  protected:
Why this change?


Line 216:   string tool_path_;
Remember to order functions first, then fields.


PS1, Line 225:   gscoped_ptr<ExternalMiniCluster> cluster_;
             :   gscoped_ptr<ExternalMiniClusterFsInspector> inspect_;
             :   unordered_map<string, TServerDetails*> ts_map_;
             :   gscoped_ptr<MiniCluster> mini_cluster_;
Can we use unique_ptrs here instead?


Line 842:   NO_FATALS(StartExternalMiniCluster({}, {"--never_fsync"}, 
num_tservers));
Okay but we lost the "// fsync causes ..." comment.


Line 937: // Test 'kudu tablet copy' tool when the destination tablet server is 
online.
Mike should definitely review this new test; I'm not familiar with the tablet 
deletion logic the way he is, and so I don't know how to assess the various 
WaitForFoo() calls.


Line 941: TEST_F(ToolTest, TestTabletCopy) {
In the various RunActionFoo() calls, you're not using the resulting 
stdout/stderr; can you replace them with no-arg calls?


Line 944:   const int kSrcTsIndex = 1, kDstTsIndex = 2, kAddOnTsIndex = 3;
Nit: declare each of these on its own line.


Line 947:       {}, {"--follower_unavailable_considered_failed_sec=3"}, 4));
Nit: should also explain (in a comment) why we need 4 tservers.


PS1, Line 951: to add this add-on tserver for all the tablets which are 
under-replicated.
Nit: to replicate all under-replicated tablets to the add-on tserver.


PS1, Line 974: consistent rows
What does "consistent" mean in this context? And how does waiting for the 
client to write at least 1000 rows guarantee it?


PS1, Line 984:  ;  i++) {
Nit: there's extra space here, both before and after the semicolon.


Line 985:     tservers.push_back(ts_map_[cluster_->tablet_server(i)->uuid()]);
Maybe loop from i to num_tservers, then skip when i == kAddOnTsIndex? That'll 
be robust to changes in kAddOnTsIndex; right now this loop assumes that 
kAddOnTsIndex is the last index.


Line 996:   string stdout, stderr;
Nit: declare on separate lines.


PS1, Line 1018: "local_replica delete $0 --fs_wal_dir=$1 "
              :                                              "--fs_data_dirs=$2
See my other comment about substitution and fs_wal_dir/fs_data_dirs.


PS1, Line 1033: auto
Can this be const auto?


Line 1067:   NO_FATALS(StartMiniCluster());
Why not use an external mini cluster here too, and avoid the need for code that 
can start either kind? Can we get by with that?

If the need is to force a flush on the particular tablet, you can set 
flush_threshold_mb very low, insert rows in a loop, and periodically look at a 
maintenance manager metric to see if a flush happened.


Line 1079:   // Grab the tablet's local replica to delete
You mean grab its tablet_id, right?


PS1, Line 1084:   TabletServerServiceProxy* ts_proxy = new 
TabletServerServiceProxy(
              :       ts->server()->messenger(), ts->bound_rpc_addr());
Do we have to allocate this from the heap? Can we just declare it on the stack?


PS1, Line 1088:   vector<ListTabletsResponsePB::StatusAndSchemaPB> tablets;
              :   tablets.assign(resp.status_and_schema().begin(), 
resp.status_and_schema().end());
              :   const string& tablet_id = 
tablets.at(0).tablet_status().tablet_id();
Why do we need the indirection of 'tablets'? Can't we reach into 
resp.status_and_schema() directly to get the first tablet? It's the only tablet 
right? We should ASSERT that.


Line 1107:   const string& data_dir = JoinPathSegments(tserver_dir, "data");
Why must we consider the data subdirectory specifically? Can't we just compare 
the size of tserver_dir before and after?


PS1, Line 1111: "local_replica delete $0 --fs_wal_dir=$1 "
              :                                               
"--fs_data_dirs=$2",
You can omit --fs_data_dirs. Or you can reuse $1, then pass tserver_dir just 
once into the substitution.


PS1, Line 1113: stdout
stdout is never used; can you call the variant that doesn't provide any output?


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

PS1, Line 276:   RETURN_NOT_OK(TSTabletManager::DeleteTabletData(
             :       meta, TabletDataState::TABLET_DATA_DELETED, boost::none));
Looks reasonable to me, but can you make sure Mike looks at this? I'm 
specifically wondering whether 1) we should tombstone the replica, and 2) 
whether we need to provide a last_logged_opid.


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS1, Line 215:   const string& src_address = FindOrDie(context.required_args, 
"src_tserver_address");
             :   const string& dst_address = FindOrDie(context.required_args, 
"dst_tserver_address");
Define the two strings as static constants earlier in the file, then use those 
here and in BuildTabletMode().


Line 232:   rpc.set_timeout(MonoDelta::FromSeconds(30)); // Should this be 
higher ?
If you don't set a timeout at all, won't the request never time out? Then you 
won't need to LOG that warning below either.


PS1, Line 247: StartTabletCopy
As per KUDU-921, we'll modify the StartTabletCopy() implementation in the 
future so that it doesn't block on the entire copy procedure finishing. We 
should adjust the name/description of this action accordingly, to imply that it 
merely starts the copy operation, but doesn't wait for it to finish.

If you think it's important for it to wait on the copy to finish, we'll need to 
add some way to ask a tserver about its outstanding tablet copies.


PS1, Line 300:       .AddRequiredParameter({ "src_tserver_address", "Source 
tablet server to copy the tablet replica from" })
             :       .AddRequiredParameter({ "dst_tserver_address", 
"Destination tablet server to copy the tablet replica" })
How about just "src_address" and "dst_address", since the descriptions do a 
good job of explaining that these are tserver addresses?

Also, try to use the same description as is used in tool_action_tserver.cc's 
kTServerAddressDesc.


Line 315:       .AddAction(std::move(copy_tablet))
I think the remote_replica mode would be a more appropriate home, since this 
operation has to do with a single replica rather than an entire tablet.


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 916:   LOG(INFO) << "T " + tablet_id + " P " + meta->fs_manager()->uuid() 
+ ": "
Can you make a static two-arg variant of LogPrefix() and chain the non-static 
variant into it? That way the actual logging format (i.e. "T " tablet_id ...") 
is still described in just one place.


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

PS1, Line 180:   // This allows the tablet cleanup even when the tablet server 
is offline.
Nit: I don't think this new sentence is necessary; we generally don't use a 
function comment to justify the existence of the function.


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Reply via email to