Dinesh Bhat has posted comments on this change. Change subject: [tools] Manual recovery tools (part 1) ......................................................................
Patch Set 7: (16 comments) http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/integration-tests/cluster_itest_util.h File src/kudu/integration-tests/cluster_itest_util.h: Line 268: const std::string& tablet_id, > std::string here and below done, unfortunately these are not caught while compiling on OS X :(. http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/kudu-tool-test.cc File src/kudu/tools/kudu-tool-test.cc: > Is it time to start breaking these tool tests out into separate tests? This I think having all the tests related to ./bin/kudu toolset in one file is better. In fact we have few more tests in kudu-admin-test which we wanted to migrate here. Is there a concern about how long this test will run ? As of now it finishes roughly under 25 secs (entire test suite). Line 950: // Force a change_config within 3 seconds of follower unavailability. > Why 3 seconds? Seems like it could make the test a bit unstable, no? Can you explain bit more what you meant by test becoming unstable ? The intent here was to kick the change_config as quickly as possible, but not as aggressive as the heartbeat timeout itself. Hence chose 2x factor here. PS6, Line 1013: ASSERT_STR_CONTAINS(stderr, "Rejecting tablet copy request"); : > Nit: this text is only used in one place, so perhaps combine the two lines Done Line 1069: // Verify that tombstoned_tablet_id is copied successfully. > Won't the below stuff race with the master trying to delete them because th You bet it does :). That's precisely I added a new routine @L1066 which waits till master evicts destination replica from its config for the tombstoned tablet we are about to copy. That way we know that tablet is in steady tombstoned state and copy below is expected to succeed. For eg, this is the race you are probably indicating which I tried to fix above: http://104.196.14.100/job/kudu-gerrit/4202/BUILD_TYPE=RELEASE/testReport/junit/(root)/ToolTest/TestRemoteReplicaCopy/ Line 1126: const string& tserver_dir = ts->options()->fs_opts.wal_path; > Try running the tool before shutdown to ensure that it fails? Done Line 1149: const string& meta_dir = JoinPathSegments(tserver_dir, > I think also restarting the TS and ensuring that the tablet doesn't exist a Done, although I am trying to see why I am not able to run ts->Start() again on OS X, it complains about already bound RPCs. But linux seems fine, I will get back on this, but updated the diffs in the meanwhile. http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/tool_action_local_replica.cc File src/kudu/tools/tool_action_local_replica.cc: Line 267: // This action cleans up metadata/data/WAL for a tablet on this node when > I think the information in this comment should instead be moved to the tool Hmmm... I am not sure if we want to be more more verbose in the tool help than 'this tool will make this tablet go away' - we don't need need to be explicit about wal/metadata going away I think ? Line 736: .Description("Delete Kudu replica from the local filesystem") > s/replica/tablet replica/ here and elsewhere in this file's help text This was intentional, when we were categorizing the toolset into multiple sub-modes, we decided to group them into relatively newer terminologies like this: 'kudu tablet' actions pertaining to tablet (or which affects tablet) 'kudu local_replica' actions on local fs when tserver not running 'kudu remote_replica' actions on the tablet replica running on another server So we abstained from using the word 'tablet' except for actions related to tablet under tool_action_tablet.cc. I also changed the help description of 'remote_replica copy' to be more consistent with this idea. Line 743: .Description("Operate on local Kudu replicas via the local filesystem") > tablet replicas same as above. http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/tool_action_remote_replica.cc File src/kudu/tools/tool_action_remote_replica.cc: Line 299: RETURN_NOT_OK(GetServerStatus(dst_address, TabletServer::kDefaultPort, > I believe GetServerStatus() will use the port embedded in the first arg and That's correct: https://github.com/apache/kudu/blob/master/src/kudu/util/net/net_util.cc#L101 Line 315: // Provide a force option if the destination tablet server has the > What does "applicable" mean here? deleted the sentence, I was trying to indicate tablet-copy may fail irrespective of the state of the tablet as long as replica exists on the node, but that's probably redundant. Line 322: LOG(WARNING) << "NOTE: this copy may happen asynchronously " > Why all these conversions? How about: Done Line 331: tserver::TabletServerErrorPB_Code(resp.error().code()))); > I think something like this would be better: Done, wrapped with RETURN_NOT_OK_PREPEND. http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 563: DCHECK(fs_manager != nullptr); > nit: DCHECK_NOTNULL macro returns its argument so consider using DCHECK(fs_ Done Line 564: return Substitute("T $0 P $1: ", tablet_id, fs_manager->uuid()); > nit: consider using strings::Substitute here to cut down on memory allocati Done -- To view, visit http://gerrit.cloudera.org:8080/4834 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
