[kudu-CR] KUDU-2290: Tool to re-create a tablet

Adar Dembo (Code Review) Tue, 13 Mar 2018 11:42:59 -0700

Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9393 )


Change subject: KUDU-2290: Tool to re-create a tablet
......................................................................


Patch Set 5:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/9393/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9393/5//COMMIT_MSG@11
PS5, Line 11: is
Nit: omit this


http://gerrit.cloudera.org:8080/#/c/9393/5//COMMIT_MSG@19
PS5, Line 19: <table id>
Given that tablet_id is globally unique and table_name already identifies the 
table for humans, what value does the inclusion of table_id add?


http://gerrit.cloudera.org:8080/#/c/9393/5//COMMIT_MSG@24
PS5, Line 24: its
Nit: it's


http://gerrit.cloudera.org:8080/#/c/9393/5//COMMIT_MSG@26
PS5, Line 26: eventually this will stop
Maybe be more specific; this behavior ends when the cached entries' TTL expires?


http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/client/meta_cache.cc
File src/kudu/client/meta_cache.cc:

http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/client/meta_cache.cc@912
PS5, Line 912:         // Partition should not have changed.
Could you clarify this a bit: the partition's _boundaries_ should have not 
changed,


http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/client/meta_cache.cc@919
PS5, Line 919:         // In this case, 'tablets_by_key' will be empty but a 
RemoteTablet
             :         // will exist in the cache.
I don't understand this case. AFAICT tablet replacement is "atomic", which 
means the client sees one of two states for a replaced tablet:
1. Tablet with ID foo from key x to y.
2. Tablet with ID bar from key x to y.

Since the boundaries of the tablet haven't changed, doesn't this mean 
tablet_lower_bound is a valid way to look up the cached entry regardless of 
which state the client is exposed to?

Looking at the code some more, it seems that following tablet replacement, 
'remote' is going to be null so we'll skip all of this, erase the tablet's 
partition from tablets_by_key (L937), set up a new RemoteTablet for the 
replacement, and insert the tablet's partition again (L950). At no point is the 
new tablet exposed in tablets_by_id_ without any corresponding partitions in 
tablets_by_key. I must be missing something here...


http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/master/catalog_manager.h@581
PS5, Line 581:   Status ReplaceTablet(const std::string& tablet_id, 
master::ReplaceTabletResponsePB* resp);
It'd be nice to have testing of this new method outside of the CLI tests. Both 
in isolation as well as in a "stress" environment (i.e. ongoing concurrent data 
operations to the table and/or replaced tablet). It'd also be nice to test 
concurrent metadata operations: what happens if I delete a table and replace 
one of its tablets at the same time? What happens if I alter it concurrently? 
Etc.


http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/master/catalog_manager.cc@4302
PS5, Line 4302:   // To be safe, we'll take the global catalog manager lock for 
the rest of this method.
While I appreciate the motivation to block out everything else, I don't think 
this is a good idea. First, the global catalog manager lock is a spinlock, 
which means anyone waiting on this lock will be spinning, and they'll be 
waiting for a long time because in ReplaceTablet the lock is held during the 
catalog write, which involves network and disk IO. Second, the catalog manager 
lock is never held while acquiring other catalog manager locks; breaking that 
invariant raises the possibility of deadlocks elsewhere. The only exceptions 
I'm aware of to this rule are during catalog manager initialization, at which 
point RPCs are largely rejected anyway.

Can we treat this like any other DDL method and acquire the right locks at the 
right time? The background task does tablet replacement in 
CatalogManager::ProcessPendingAssignments; can that be used as a template for 
how to do it safely here?


http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/master/catalog_manager.cc@4303
PS5, Line 4303: lock
Style nit: Use l (or l_foo) for lock guards. It especially helps in this case, 
since lock and lock_ are so similar.


http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/master/master.proto
File src/kudu/master/master.proto:

http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/master/master.proto@707
PS5, Line 707: recovered
Nit: "manually recovered", to make it clear that any actual recovery is out of 
scope of Kudu's normal operations?


http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/master/master.proto@715
PS5, Line 715:
I consider myself to be a protobuf luddite, but I don't understand why 
'optional' is appropriate for the below fields. There's the philosophical 
argument that all fields should be optional and 'required' was a mistake, but 
AFAICT this just pushes the burden of message verification to the receiving 
end. In this particular case, doesn't it mean that the client receiving the 
response has to verify that these optional fields exist in the response, and 
surface an error if they don't? That just seems like unnecessary (and 
untestable) boilerplate code.


http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/tools/tool_action_common.cc
File src/kudu/tools/tool_action_common.cc:

http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/tools/tool_action_common.cc@593
PS5, Line 593:
Nit: these were all grouped together, so either omit this empty line, or add an 
empty line just past L584.


http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/tools/tool_action_tablet.cc@537
PS5, Line 537:   proxy.SyncRpc<ReplaceTabletRequestPB, ReplaceTabletResponsePB>(
This returns a Status that should be checked.


http://gerrit.cloudera.org:8080/#/c/9393/5/src/kudu/tools/tool_action_tablet.cc@631
PS5, Line 631: .
Maybe add a \n here? See what it looks like on the command line first though.



--
To view, visit http://gerrit.cloudera.org:8080/9393
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifbefbde68e3ca724f04efe0426a3906e5c33ed3c
Gerrit-Change-Number: 9393
Gerrit-PatchSet: 5
Gerrit-Owner: Will Berkeley <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Will Berkeley <[email protected]>
Gerrit-Comment-Date: Tue, 13 Mar 2018 18:42:25 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2290: Tool to re-create a tablet

Reply via email to