Adar Dembo has posted comments on this change. ( )

Change subject: KUDU-2290: Tool to re-create a tablet

Patch Set 5:

Commit Message:
PS5, Line 11: is
Nit: omit this
PS5, Line 19: <table id>
Given that tablet_id is globally unique and table_name already identifies the 
table for humans, what value does the inclusion of table_id add?
PS5, Line 24: its
Nit: it's
PS5, Line 26: eventually this will stop
Maybe be more specific; this behavior ends when the cached entries' TTL expires?
File src/kudu/client/
PS5, Line 912:         // Partition should not have changed.
Could you clarify this a bit: the partition's _boundaries_ should have not 
PS5, Line 919:         // In this case, 'tablets_by_key' will be empty but a 
             :         // will exist in the cache.
I don't understand this case. AFAICT tablet replacement is "atomic", which 
means the client sees one of two states for a replaced tablet:
1. Tablet with ID foo from key x to y.
2. Tablet with ID bar from key x to y.

Since the boundaries of the tablet haven't changed, doesn't this mean 
tablet_lower_bound is a valid way to look up the cached entry regardless of 
which state the client is exposed to?

Looking at the code some more, it seems that following tablet replacement, 
'remote' is going to be null so we'll skip all of this, erase the tablet's 
partition from tablets_by_key (L937), set up a new RemoteTablet for the 
replacement, and insert the tablet's partition again (L950). At no point is the 
new tablet exposed in tablets_by_id_ without any corresponding partitions in 
tablets_by_key. I must be missing something here...
File src/kudu/master/catalog_manager.h:
PS5, Line 581:   Status ReplaceTablet(const std::string& tablet_id, 
master::ReplaceTabletResponsePB* resp);
It'd be nice to have testing of this new method outside of the CLI tests. Both 
in isolation as well as in a "stress" environment (i.e. ongoing concurrent data 
operations to the table and/or replaced tablet). It'd also be nice to test 
concurrent metadata operations: what happens if I delete a table and replace 
one of its tablets at the same time? What happens if I alter it concurrently? 
File src/kudu/master/
PS5, Line 4302:   // To be safe, we'll take the global catalog manager lock for 
the rest of this method.
While I appreciate the motivation to block out everything else, I don't think 
this is a good idea. First, the global catalog manager lock is a spinlock, 
which means anyone waiting on this lock will be spinning, and they'll be 
waiting for a long time because in ReplaceTablet the lock is held during the 
catalog write, which involves network and disk IO. Second, the catalog manager 
lock is never held while acquiring other catalog manager locks; breaking that 
invariant raises the possibility of deadlocks elsewhere. The only exceptions 
I'm aware of to this rule are during catalog manager initialization, at which 
point RPCs are largely rejected anyway.

Can we treat this like any other DDL method and acquire the right locks at the 
right time? The background task does tablet replacement in 
CatalogManager::ProcessPendingAssignments; can that be used as a template for 
how to do it safely here?
PS5, Line 4303: lock
Style nit: Use l (or l_foo) for lock guards. It especially helps in this case, 
since lock and lock_ are so similar.
File src/kudu/master/master.proto:
PS5, Line 707: recovered
Nit: "manually recovered", to make it clear that any actual recovery is out of 
scope of Kudu's normal operations?
PS5, Line 715:
I consider myself to be a protobuf luddite, but I don't understand why 
'optional' is appropriate for the below fields. There's the philosophical 
argument that all fields should be optional and 'required' was a mistake, but 
AFAICT this just pushes the burden of message verification to the receiving 
end. In this particular case, doesn't it mean that the client receiving the 
response has to verify that these optional fields exist in the response, and 
surface an error if they don't? That just seems like unnecessary (and 
untestable) boilerplate code.
File src/kudu/tools/
PS5, Line 593:
Nit: these were all grouped together, so either omit this empty line, or add an 
empty line just past L584.
File src/kudu/tools/
PS5, Line 537:   proxy.SyncRpc<ReplaceTabletRequestPB, ReplaceTabletResponsePB>(
This returns a Status that should be checked.
PS5, Line 631: .
Maybe add a \n here? See what it looks like on the command line first though.

To view, visit
To unsubscribe, visit

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifbefbde68e3ca724f04efe0426a3906e5c33ed3c
Gerrit-Change-Number: 9393
Gerrit-PatchSet: 5
Gerrit-Owner: Will Berkeley <>
Gerrit-Reviewer: Adar Dembo <>
Gerrit-Reviewer: Andrew Wong <>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <>
Gerrit-Reviewer: Will Berkeley <>
Gerrit-Comment-Date: Tue, 13 Mar 2018 18:42:25 +0000
Gerrit-HasComments: Yes

Reply via email to