Dinesh Bhat has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 1:

(38 comments)

http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/integration-tests/cluster_itest_util.cc
File src/kudu/integration-tests/cluster_itest_util.cc:

PS1, Line 224:   return Status::TimedOut(Substitute("Index $0 not available on 
all replicas after $1. "
             :                                               "Replicas: [ $2 ]",
             :                                               log_index, 
passed.ToString(), replicas_str));
> Good catch. Could you reindent the continuation lines though?
Done


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 56: #include "kudu/tablet/tablet.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


Line 84: using consensus::ConsensusConfigType;
> warning: using decl 'ConsensusConfigType' is unused [misc-unused-using-decl
Done


Line 85: using consensus::ConsensusStatePB;
> warning: using decl 'ConsensusStatePB' is unused [misc-unused-using-decls]
Done


Line 87: using consensus::RaftPeerPB;
> warning: using decl 'RaftPeerPB' is unused [misc-unused-using-decls]
Done


PS1, Line 172: RunActionStderrString
> Nit: RunActionStdoutStderrString (since it's returning both).
Done


Line 215:  protected:
> Why this change?
TEST_F macro eventually defines a class for each of the tests defined below and 
they inherit ToolTest. So if I want to be able to access these members in 
ToolTest, I have to keep them protected(as opposed to private). Also looked at 
few other examples too following this approach.


Line 216:   string tool_path_;
> Remember to order functions first, then fields.
thank you for catching, done.


PS1, Line 225:   gscoped_ptr<ExternalMiniCluster> cluster_;
             :   gscoped_ptr<ExternalMiniClusterFsInspector> inspect_;
             :   unordered_map<string, TServerDetails*> ts_map_;
             :   gscoped_ptr<MiniCluster> mini_cluster_;
> Can we use unique_ptrs here instead?
Done


Line 842:   NO_FATALS(StartExternalMiniCluster({}, {"--never_fsync"}, 
num_tservers));
> Okay but we lost the "// fsync causes ..." comment.
Done


Line 937: // Test 'kudu tablet copy' tool when the destination tablet server is 
online.
> Mike should definitely review this new test; I'm not familiar with the tabl
Sure, I have added him, will poke him in hipchat too.


Line 941: TEST_F(ToolTest, TestTabletCopy) {
> In the various RunActionFoo() calls, you're not using the resulting stdout/
Essentially, I wanted to do this:
s = RunTool(std_args, stdout, stderr)
SCOPED_TRACE(stdout)
SCOPED_TRACE(stderr)
ASSERT_OK(s);

While running the tablet copy tool test below(with RunActionStdoutNone), 
because of a bug I had in the test, the RunTool was hitting a RunTimeError, and 
the only error I was seeing in the log was 'bin/kudu exited with non-zero 
status' which led me to add RunToolStdoutStdErr() variant to print more scope 
logs in place. Hence, I have kept them not because this routine uses it, but 
because the callee uses to SCOPE_TRACE stdout/stderr which is useful at times.


Line 944:   const int kSrcTsIndex = 1, kDstTsIndex = 2, kAddOnTsIndex = 3;
> Nit: declare each of these on its own line.
Done


Line 947:       {}, {"--follower_unavailable_considered_failed_sec=3"}, 4));
> Nit: should also explain (in a comment) why we need 4 tservers.
Done


PS1, Line 951: to add this add-on tserver for all the tablets which are 
under-replicated.
> Nit: to replicate all under-replicated tablets to the add-on tserver.
Done


PS1, Line 974: consistent rows
> What does "consistent" mean in this context? And how does waiting for the c
No, this was a stale comment leftover, I think I was doing a  
WaitForServersToAgree(workload.batches_completed()) earlier which got removed. 
Removed the comment.


PS1, Line 984:  ;  i++) {
> Nit: there's extra space here, both before and after the semicolon.
Done


Line 985:     tservers.push_back(ts_map_[cluster_->tablet_server(i)->uuid()]);
> Maybe loop from i to num_tservers, then skip when i == kAddOnTsIndex? That'
Done


Line 996:   string stdout, stderr;
> Nit: declare on separate lines.
Done


PS1, Line 1018: "local_replica delete $0 --fs_wal_dir=$1 "
              :                                              "--fs_data_dirs=$2
> See my other comment about substitution and fs_wal_dir/fs_data_dirs.
Done


PS1, Line 1033: auto
> Can this be const auto?
Done


Line 1067:   NO_FATALS(StartMiniCluster());
> Why not use an external mini cluster here too, and avoid the need for code 
I added it mainly to have more control over MiniTabletServer operations, and 
yes flush probably is the main reason at this point, but I think it could be 
useful for future tests I have in mind. I prefer to not to rely on 
flags/metrics for flush. However, I am not fully against the idea and I could 
replace the MiniCluster for this test with the ideas you suggested, please 
lemme know where I could look at such metrics.


Line 1079:   // Grab the tablet's local replica to delete
> You mean grab its tablet_id, right?
Done


PS1, Line 1084:   TabletServerServiceProxy* ts_proxy = new 
TabletServerServiceProxy(
              :       ts->server()->messenger(), ts->bound_rpc_addr());
> Do we have to allocate this from the heap? Can we just declare it on the st
Done


PS1, Line 1088:   vector<ListTabletsResponsePB::StatusAndSchemaPB> tablets;
              :   tablets.assign(resp.status_and_schema().begin(), 
resp.status_and_schema().end());
              :   const string& tablet_id = 
tablets.at(0).tablet_status().tablet_id();
> Why do we need the indirection of 'tablets'? Can't we reach into resp.statu
Done


Line 1107:   const string& data_dir = JoinPathSegments(tserver_dir, "data");
> Why must we consider the data subdirectory specifically? Can't we just comp
Yeah, since delete is a 3-step operation, I wanted to make sure either 
subdirectory size is reduced(data) or they are gone(WAL/tablet-meta). Hence 
added checks against individual subdirectories.


PS1, Line 1111: "local_replica delete $0 --fs_wal_dir=$1 "
              :                                               
"--fs_data_dirs=$2",
> You can omit --fs_data_dirs. Or you can reuse $1, then pass tserver_dir jus
Done


PS1, Line 1113: stdout
> stdout is never used; can you call the variant that doesn't provide any out
Done


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

Line 115: using tablet::Tablet;
> warning: using decl 'Tablet' is unused [misc-unused-using-decls]
Done


Line 120: using tserver::WriteRequestPB;
> warning: using decl 'WriteRequestPB' is unused [misc-unused-using-decls]
Done


PS1, Line 276:   RETURN_NOT_OK(TSTabletManager::DeleteTabletData(
             :       meta, TabletDataState::TABLET_DATA_DELETED, boost::none));
> Looks reasonable to me, but can you make sure Mike looks at this? I'm speci
Yeah sure, my first approach was to provide last_logged_opid, but I think that 
may be more relevant when we provide a --archive option along with 'delete'. 
One thing I can add in the tests is to check if WAL.recovery files are deleted 
too with this.


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS1, Line 215:   const string& src_address = FindOrDie(context.required_args, 
"src_tserver_address");
             :   const string& dst_address = FindOrDie(context.required_args, 
"dst_tserver_address");
> Define the two strings as static constants earlier in the file, then use th
Done


Line 232:   rpc.set_timeout(MonoDelta::FromSeconds(30)); // Should this be 
higher ?
> If you don't set a timeout at all, won't the request never time out? Then y
Honestly, I haven't gone down that road to verify that, but in theory that 
seems like right and I will confirm with a  large tablet copy. However, I still 
want to keep the warning to keep the user aware of this op taking long time.


PS1, Line 247: StartTabletCopy
> As per KUDU-921, we'll modify the StartTabletCopy() implementation in the f
Thank you for pointing out that JIRA, I have re-worded the warning slightly to 
indicate that it could be an async op, please suggest a better wording if it's 
not clear. Ideally, this tool should block for the tablet copy, and the 
solution you suggested can be incorporated once resolve kudu-921. But it would 
be interesting to know how the background thread will report back to the source 
tserver that copy failed.


PS1, Line 300:       .AddRequiredParameter({ "src_tserver_address", "Source 
tablet server to copy the tablet replica from" })
             :       .AddRequiredParameter({ "dst_tserver_address", 
"Destination tablet server to copy the tablet replica" })
> How about just "src_address" and "dst_address", since the descriptions do a
Done


Line 315:       .AddAction(std::move(copy_tablet))
> I think the remote_replica mode would be a more appropriate home, since thi
Done


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 916:   LOG(INFO) << "T " + tablet_id + " P " + meta->fs_manager()->uuid() 
+ ": "
> Can you make a static two-arg variant of LogPrefix() and chain the non-stat
Done


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

PS1, Line 180:   // This allows the tablet cleanup even when the tablet server 
is offline.
> Nit: I don't think this new sentence is necessary; we generally don't use a
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to