Adar Dembo has posted comments on this change. Change subject: [tools] Manual recovery tools (part 1) ......................................................................
Patch Set 1: (32 comments) http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/integration-tests/cluster_itest_util.cc File src/kudu/integration-tests/cluster_itest_util.cc: PS1, Line 224: return Status::TimedOut(Substitute("Index $0 not available on all replicas after $1. " : "Replicas: [ $2 ]", : log_index, passed.ToString(), replicas_str)); Good catch. Could you reindent the continuation lines though? http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/kudu-tool-test.cc File src/kudu/tools/kudu-tool-test.cc: PS1, Line 172: RunActionStderrString Nit: RunActionStdoutStderrString (since it's returning both). Line 215: protected: Why this change? Line 216: string tool_path_; Remember to order functions first, then fields. PS1, Line 225: gscoped_ptr<ExternalMiniCluster> cluster_; : gscoped_ptr<ExternalMiniClusterFsInspector> inspect_; : unordered_map<string, TServerDetails*> ts_map_; : gscoped_ptr<MiniCluster> mini_cluster_; Can we use unique_ptrs here instead? Line 842: NO_FATALS(StartExternalMiniCluster({}, {"--never_fsync"}, num_tservers)); Okay but we lost the "// fsync causes ..." comment. Line 937: // Test 'kudu tablet copy' tool when the destination tablet server is online. Mike should definitely review this new test; I'm not familiar with the tablet deletion logic the way he is, and so I don't know how to assess the various WaitForFoo() calls. Line 941: TEST_F(ToolTest, TestTabletCopy) { In the various RunActionFoo() calls, you're not using the resulting stdout/stderr; can you replace them with no-arg calls? Line 944: const int kSrcTsIndex = 1, kDstTsIndex = 2, kAddOnTsIndex = 3; Nit: declare each of these on its own line. Line 947: {}, {"--follower_unavailable_considered_failed_sec=3"}, 4)); Nit: should also explain (in a comment) why we need 4 tservers. PS1, Line 951: to add this add-on tserver for all the tablets which are under-replicated. Nit: to replicate all under-replicated tablets to the add-on tserver. PS1, Line 974: consistent rows What does "consistent" mean in this context? And how does waiting for the client to write at least 1000 rows guarantee it? PS1, Line 984: ; i++) { Nit: there's extra space here, both before and after the semicolon. Line 985: tservers.push_back(ts_map_[cluster_->tablet_server(i)->uuid()]); Maybe loop from i to num_tservers, then skip when i == kAddOnTsIndex? That'll be robust to changes in kAddOnTsIndex; right now this loop assumes that kAddOnTsIndex is the last index. Line 996: string stdout, stderr; Nit: declare on separate lines. PS1, Line 1018: "local_replica delete $0 --fs_wal_dir=$1 " : "--fs_data_dirs=$2 See my other comment about substitution and fs_wal_dir/fs_data_dirs. PS1, Line 1033: auto Can this be const auto? Line 1067: NO_FATALS(StartMiniCluster()); Why not use an external mini cluster here too, and avoid the need for code that can start either kind? Can we get by with that? If the need is to force a flush on the particular tablet, you can set flush_threshold_mb very low, insert rows in a loop, and periodically look at a maintenance manager metric to see if a flush happened. Line 1079: // Grab the tablet's local replica to delete You mean grab its tablet_id, right? PS1, Line 1084: TabletServerServiceProxy* ts_proxy = new TabletServerServiceProxy( : ts->server()->messenger(), ts->bound_rpc_addr()); Do we have to allocate this from the heap? Can we just declare it on the stack? PS1, Line 1088: vector<ListTabletsResponsePB::StatusAndSchemaPB> tablets; : tablets.assign(resp.status_and_schema().begin(), resp.status_and_schema().end()); : const string& tablet_id = tablets.at(0).tablet_status().tablet_id(); Why do we need the indirection of 'tablets'? Can't we reach into resp.status_and_schema() directly to get the first tablet? It's the only tablet right? We should ASSERT that. Line 1107: const string& data_dir = JoinPathSegments(tserver_dir, "data"); Why must we consider the data subdirectory specifically? Can't we just compare the size of tserver_dir before and after? PS1, Line 1111: "local_replica delete $0 --fs_wal_dir=$1 " : "--fs_data_dirs=$2", You can omit --fs_data_dirs. Or you can reuse $1, then pass tserver_dir just once into the substitution. PS1, Line 1113: stdout stdout is never used; can you call the variant that doesn't provide any output? http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_local_replica.cc File src/kudu/tools/tool_action_local_replica.cc: PS1, Line 276: RETURN_NOT_OK(TSTabletManager::DeleteTabletData( : meta, TabletDataState::TABLET_DATA_DELETED, boost::none)); Looks reasonable to me, but can you make sure Mike looks at this? I'm specifically wondering whether 1) we should tombstone the replica, and 2) whether we need to provide a last_logged_opid. http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_tablet.cc File src/kudu/tools/tool_action_tablet.cc: PS1, Line 215: const string& src_address = FindOrDie(context.required_args, "src_tserver_address"); : const string& dst_address = FindOrDie(context.required_args, "dst_tserver_address"); Define the two strings as static constants earlier in the file, then use those here and in BuildTabletMode(). Line 232: rpc.set_timeout(MonoDelta::FromSeconds(30)); // Should this be higher ? If you don't set a timeout at all, won't the request never time out? Then you won't need to LOG that warning below either. PS1, Line 247: StartTabletCopy As per KUDU-921, we'll modify the StartTabletCopy() implementation in the future so that it doesn't block on the entire copy procedure finishing. We should adjust the name/description of this action accordingly, to imply that it merely starts the copy operation, but doesn't wait for it to finish. If you think it's important for it to wait on the copy to finish, we'll need to add some way to ask a tserver about its outstanding tablet copies. PS1, Line 300: .AddRequiredParameter({ "src_tserver_address", "Source tablet server to copy the tablet replica from" }) : .AddRequiredParameter({ "dst_tserver_address", "Destination tablet server to copy the tablet replica" }) How about just "src_address" and "dst_address", since the descriptions do a good job of explaining that these are tserver addresses? Also, try to use the same description as is used in tool_action_tserver.cc's kTServerAddressDesc. Line 315: .AddAction(std::move(copy_tablet)) I think the remote_replica mode would be a more appropriate home, since this operation has to do with a single replica rather than an entire tablet. http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 916: LOG(INFO) << "T " + tablet_id + " P " + meta->fs_manager()->uuid() + ": " Can you make a static two-arg variant of LogPrefix() and chain the non-static variant into it? That way the actual logging format (i.e. "T " tablet_id ...") is still described in just one place. http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: PS1, Line 180: // This allows the tablet cleanup even when the tablet server is offline. Nit: I don't think this new sentence is necessary; we generally don't use a function comment to justify the existence of the function. -- To view, visit http://gerrit.cloudera.org:8080/4834 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
