Dinesh Bhat has posted comments on this change. Change subject: [tools] Manual recovery tools (part 1) ......................................................................
Patch Set 1: (38 comments) http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/integration-tests/cluster_itest_util.cc File src/kudu/integration-tests/cluster_itest_util.cc: PS1, Line 224: return Status::TimedOut(Substitute("Index $0 not available on all replicas after $1. " : "Replicas: [ $2 ]", : log_index, passed.ToString(), replicas_str)); > Good catch. Could you reindent the continuation lines though? Done http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/kudu-tool-test.cc File src/kudu/tools/kudu-tool-test.cc: Line 56: #include "kudu/tablet/tablet.h" > warning: #includes are not sorted properly [llvm-include-order] Done Line 84: using consensus::ConsensusConfigType; > warning: using decl 'ConsensusConfigType' is unused [misc-unused-using-decl Done Line 85: using consensus::ConsensusStatePB; > warning: using decl 'ConsensusStatePB' is unused [misc-unused-using-decls] Done Line 87: using consensus::RaftPeerPB; > warning: using decl 'RaftPeerPB' is unused [misc-unused-using-decls] Done PS1, Line 172: RunActionStderrString > Nit: RunActionStdoutStderrString (since it's returning both). Done Line 215: protected: > Why this change? TEST_F macro eventually defines a class for each of the tests defined below and they inherit ToolTest. So if I want to be able to access these members in ToolTest, I have to keep them protected(as opposed to private). Also looked at few other examples too following this approach. Line 216: string tool_path_; > Remember to order functions first, then fields. thank you for catching, done. PS1, Line 225: gscoped_ptr<ExternalMiniCluster> cluster_; : gscoped_ptr<ExternalMiniClusterFsInspector> inspect_; : unordered_map<string, TServerDetails*> ts_map_; : gscoped_ptr<MiniCluster> mini_cluster_; > Can we use unique_ptrs here instead? Done Line 842: NO_FATALS(StartExternalMiniCluster({}, {"--never_fsync"}, num_tservers)); > Okay but we lost the "// fsync causes ..." comment. Done Line 937: // Test 'kudu tablet copy' tool when the destination tablet server is online. > Mike should definitely review this new test; I'm not familiar with the tabl Sure, I have added him, will poke him in hipchat too. Line 941: TEST_F(ToolTest, TestTabletCopy) { > In the various RunActionFoo() calls, you're not using the resulting stdout/ Essentially, I wanted to do this: s = RunTool(std_args, stdout, stderr) SCOPED_TRACE(stdout) SCOPED_TRACE(stderr) ASSERT_OK(s); While running the tablet copy tool test below(with RunActionStdoutNone), because of a bug I had in the test, the RunTool was hitting a RunTimeError, and the only error I was seeing in the log was 'bin/kudu exited with non-zero status' which led me to add RunToolStdoutStdErr() variant to print more scope logs in place. Hence, I have kept them not because this routine uses it, but because the callee uses to SCOPE_TRACE stdout/stderr which is useful at times. Line 944: const int kSrcTsIndex = 1, kDstTsIndex = 2, kAddOnTsIndex = 3; > Nit: declare each of these on its own line. Done Line 947: {}, {"--follower_unavailable_considered_failed_sec=3"}, 4)); > Nit: should also explain (in a comment) why we need 4 tservers. Done PS1, Line 951: to add this add-on tserver for all the tablets which are under-replicated. > Nit: to replicate all under-replicated tablets to the add-on tserver. Done PS1, Line 974: consistent rows > What does "consistent" mean in this context? And how does waiting for the c No, this was a stale comment leftover, I think I was doing a WaitForServersToAgree(workload.batches_completed()) earlier which got removed. Removed the comment. PS1, Line 984: ; i++) { > Nit: there's extra space here, both before and after the semicolon. Done Line 985: tservers.push_back(ts_map_[cluster_->tablet_server(i)->uuid()]); > Maybe loop from i to num_tservers, then skip when i == kAddOnTsIndex? That' Done Line 996: string stdout, stderr; > Nit: declare on separate lines. Done PS1, Line 1018: "local_replica delete $0 --fs_wal_dir=$1 " : "--fs_data_dirs=$2 > See my other comment about substitution and fs_wal_dir/fs_data_dirs. Done PS1, Line 1033: auto > Can this be const auto? Done Line 1067: NO_FATALS(StartMiniCluster()); > Why not use an external mini cluster here too, and avoid the need for code I added it mainly to have more control over MiniTabletServer operations, and yes flush probably is the main reason at this point, but I think it could be useful for future tests I have in mind. I prefer to not to rely on flags/metrics for flush. However, I am not fully against the idea and I could replace the MiniCluster for this test with the ideas you suggested, please lemme know where I could look at such metrics. Line 1079: // Grab the tablet's local replica to delete > You mean grab its tablet_id, right? Done PS1, Line 1084: TabletServerServiceProxy* ts_proxy = new TabletServerServiceProxy( : ts->server()->messenger(), ts->bound_rpc_addr()); > Do we have to allocate this from the heap? Can we just declare it on the st Done PS1, Line 1088: vector<ListTabletsResponsePB::StatusAndSchemaPB> tablets; : tablets.assign(resp.status_and_schema().begin(), resp.status_and_schema().end()); : const string& tablet_id = tablets.at(0).tablet_status().tablet_id(); > Why do we need the indirection of 'tablets'? Can't we reach into resp.statu Done Line 1107: const string& data_dir = JoinPathSegments(tserver_dir, "data"); > Why must we consider the data subdirectory specifically? Can't we just comp Yeah, since delete is a 3-step operation, I wanted to make sure either subdirectory size is reduced(data) or they are gone(WAL/tablet-meta). Hence added checks against individual subdirectories. PS1, Line 1111: "local_replica delete $0 --fs_wal_dir=$1 " : "--fs_data_dirs=$2", > You can omit --fs_data_dirs. Or you can reuse $1, then pass tserver_dir jus Done PS1, Line 1113: stdout > stdout is never used; can you call the variant that doesn't provide any out Done http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_local_replica.cc File src/kudu/tools/tool_action_local_replica.cc: Line 115: using tablet::Tablet; > warning: using decl 'Tablet' is unused [misc-unused-using-decls] Done Line 120: using tserver::WriteRequestPB; > warning: using decl 'WriteRequestPB' is unused [misc-unused-using-decls] Done PS1, Line 276: RETURN_NOT_OK(TSTabletManager::DeleteTabletData( : meta, TabletDataState::TABLET_DATA_DELETED, boost::none)); > Looks reasonable to me, but can you make sure Mike looks at this? I'm speci Yeah sure, my first approach was to provide last_logged_opid, but I think that may be more relevant when we provide a --archive option along with 'delete'. One thing I can add in the tests is to check if WAL.recovery files are deleted too with this. http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_tablet.cc File src/kudu/tools/tool_action_tablet.cc: PS1, Line 215: const string& src_address = FindOrDie(context.required_args, "src_tserver_address"); : const string& dst_address = FindOrDie(context.required_args, "dst_tserver_address"); > Define the two strings as static constants earlier in the file, then use th Done Line 232: rpc.set_timeout(MonoDelta::FromSeconds(30)); // Should this be higher ? > If you don't set a timeout at all, won't the request never time out? Then y Honestly, I haven't gone down that road to verify that, but in theory that seems like right and I will confirm with a large tablet copy. However, I still want to keep the warning to keep the user aware of this op taking long time. PS1, Line 247: StartTabletCopy > As per KUDU-921, we'll modify the StartTabletCopy() implementation in the f Thank you for pointing out that JIRA, I have re-worded the warning slightly to indicate that it could be an async op, please suggest a better wording if it's not clear. Ideally, this tool should block for the tablet copy, and the solution you suggested can be incorporated once resolve kudu-921. But it would be interesting to know how the background thread will report back to the source tserver that copy failed. PS1, Line 300: .AddRequiredParameter({ "src_tserver_address", "Source tablet server to copy the tablet replica from" }) : .AddRequiredParameter({ "dst_tserver_address", "Destination tablet server to copy the tablet replica" }) > How about just "src_address" and "dst_address", since the descriptions do a Done Line 315: .AddAction(std::move(copy_tablet)) > I think the remote_replica mode would be a more appropriate home, since thi Done http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 916: LOG(INFO) << "T " + tablet_id + " P " + meta->fs_manager()->uuid() + ": " > Can you make a static two-arg variant of LogPrefix() and chain the non-stat Done http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: PS1, Line 180: // This allows the tablet cleanup even when the tablet server is offline. > Nit: I don't think this new sentence is necessary; we generally don't use a Done -- To view, visit http://gerrit.cloudera.org:8080/4834 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
