Dinesh Bhat has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 7:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/integration-tests/cluster_itest_util.h
File src/kudu/integration-tests/cluster_itest_util.h:

Line 268:     const std::string& tablet_id,
> std::string here and below
done, unfortunately these are not caught while compiling on OS X :(.


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

> Is it time to start breaking these tool tests out into separate tests? This
I think having all the tests related to ./bin/kudu toolset in one file is 
better. In fact we have few more tests in kudu-admin-test which we wanted to 
migrate here. Is there a concern about how long this test will run ? As of now 
it finishes roughly under 25 secs (entire test suite).


Line 950:   // Force a change_config within 3 seconds of follower 
unavailability.
> Why 3 seconds? Seems like it could make the test a bit unstable, no?
Can you explain bit more what you meant by test becoming unstable ? The intent 
here was to kick the change_config as quickly as possible, but not as 
aggressive as the heartbeat timeout itself. Hence chose 2x factor here.


PS6, Line 1013:   ASSERT_STR_CONTAINS(stderr, "Rejecting tablet copy request");
              : 
> Nit: this text is only used in one place, so perhaps combine the two lines 
Done


Line 1069:   // Verify that tombstoned_tablet_id is copied successfully.
> Won't the below stuff race with the master trying to delete them because th
You bet it does :). That's precisely I added a new routine @L1066 which waits 
till master evicts destination replica from its config for the tombstoned 
tablet we are about to copy.  That way we know that tablet is in steady 
tombstoned state and copy below is expected to succeed.

For eg, this is the race you are probably indicating which I tried to fix 
above: 
http://104.196.14.100/job/kudu-gerrit/4202/BUILD_TYPE=RELEASE/testReport/junit/(root)/ToolTest/TestRemoteReplicaCopy/


Line 1126:   const string& tserver_dir = ts->options()->fs_opts.wal_path;
> Try running the tool before shutdown to ensure that it fails?
Done


Line 1149:   const string& meta_dir = JoinPathSegments(tserver_dir,
> I think also restarting the TS and ensuring that the tablet doesn't exist a
Done, although I am trying to see why I am not able to run ts->Start() again on 
OS X, it complains about already bound RPCs. But linux seems fine, I will get 
back on this, but updated the diffs in the meanwhile.


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

Line 267:   // This action cleans up metadata/data/WAL for a tablet on this 
node when
> I think the information in this comment should instead be moved to the tool
Hmmm... I am not sure if we want to be more more verbose in the tool help than 
'this tool will make this tablet go away' - we don't need need to be explicit 
about wal/metadata going away I think ?


Line 736:       .Description("Delete Kudu replica from the local filesystem")
> s/replica/tablet replica/ here and elsewhere in this file's help text
This was intentional, when we were categorizing the toolset into multiple 
sub-modes, we decided to group them into relatively newer terminologies like 
this: 
'kudu tablet'  actions pertaining to tablet (or which affects tablet)
'kudu local_replica' actions on local fs when tserver not running
'kudu remote_replica' actions on the tablet replica running on another server

So we abstained from using the word 'tablet' except for actions related to 
tablet under tool_action_tablet.cc. I also changed the help description of 
'remote_replica copy' to be more consistent with this idea.


Line 743:       .Description("Operate on local Kudu replicas via the local 
filesystem")
> tablet replicas
same as above.


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 299:   RETURN_NOT_OK(GetServerStatus(dst_address, 
TabletServer::kDefaultPort,
> I believe GetServerStatus() will use the port embedded in the first arg and
That's correct:
https://github.com/apache/kudu/blob/master/src/kudu/util/net/net_util.cc#L101


Line 315:   // Provide a force option if the destination tablet server has the
> What does "applicable" mean here?
deleted the sentence, I was trying to indicate tablet-copy may fail 
irrespective of the state of the tablet as long as replica exists on the node, 
but that's probably redundant.


Line 322:   LOG(WARNING) << "NOTE: this copy may happen asynchronously "
> Why all these conversions? How about:
Done


Line 331:                             
tserver::TabletServerErrorPB_Code(resp.error().code())));
> I think something like this would be better:
Done, wrapped with RETURN_NOT_OK_PREPEND.


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 563:   DCHECK(fs_manager != nullptr);
> nit: DCHECK_NOTNULL macro returns its argument so consider using DCHECK(fs_
Done


Line 564:   return Substitute("T $0 P $1: ", tablet_id, fs_manager->uuid());
> nit: consider using strings::Substitute here to cut down on memory allocati
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to