Dan Burkert has posted comments on this change.

Change subject: [java-client] implement scan token API
......................................................................


Patch Set 10:

(21 comments)

The M/R backwards incompatibilities are noted in the release notes.

http://gerrit.cloudera.org:8080/#/c/2592/10/java/kudu-client/src/main/java/org/kududb/client/ScanToken.java
File java/kudu-client/src/main/java/org/kududb/client/ScanToken.java:

Line 57: final LocatedTablet tablet;
       :   final ScanTokenPB message;
> Should be private, right?
Done


Line 83:    * Serializes this {@code ScanToken} into a byte buffer.
> Nit: maybe "byte array" to disambiguate from ByteBuffer?
Done


Line 106:   private static KuduScanner pbIntoScanner(ScanTokenPB message,
> Do we want to be able to build async scanners too?
Perhaps eventually.  I want to keep the API as small as possible at this point. 
 None of the integrations that I know of would benefit from an async scanner.


Line 109:         
!message.getFeatureFlagsList().contains(ScanTokenPB.Feature.Unknown),
> But Unknown isn't actually a feature, right? So why bother with this check 
Protobuf has the dubious 'feature' of interpreting any unknown enum variant 
value during serialization as the default variant, and the default variant is 
declared to be the first variant in source listing order, in this case 
"Unknown".


Line 112:     KuduTable table = client.openTable(message.getTableName());
> It's a shame that this needs to be done for each token, as the same KuduTab
Perhaps, but the normal usecase of sending the token to a new JVM context for 
execution means that the table wouldn't be opened otherwise, anyway.

If we want to get fancy we could introduce a weak ref cache of tables in the 
client, but I don't want to introduce that as part of this change.


Line 183:       throw new IllegalArgumentException("Scan tokens from different 
builders may not be compared");
> Nit: from different tables?
Done


Line 209:     public ScanTokenBuilder setTimeout(long timeoutMs) {
> I take it this exists because you don't want to overload the meaning of sca
It's a separate timeout.  This one is used for building the tokens (which may 
timeout if the master is extremely slow responding to get tablet location 
requests), the scanRequestTimeout is used by the scanner.


Line 223:       for (KuduPredicate predicate : this.predicates.values()) {
> Nit: why 'this' here but not when referring to lower and upper bound partit
Done


Line 275:       // if the last propagated timestamp is set send it with the scan
> Nit: start with capital letter, terminate with period. Below too.
Done


http://gerrit.cloudera.org:8080/#/c/2592/10/java/kudu-client/src/test/java/org/kududb/client/TestKuduClient.java
File java/kudu-client/src/test/java/org/kududb/client/TestKuduClient.java:

Line 36: import com.google.common.collect.ImmutableList;
       : import com.google.common.collect.Iterators;
> Nit: these should precede java.util.
Done


Line 348:     Assert.assertEquals(16, tokens.size());
> These asserts were imported statically so you needn't precede them with "As
Done


Line 355:       new Thread(new Runnable() {
> It's not obvious why scan token deserialization and scanning is done in par
Done


Line 379:     Assert.assertTrue(latch.await(tokens.size(), TimeUnit.SECONDS));
> What if you just kept references to all the threads and then join() on them
Managing the timeout is significantly harder.


http://gerrit.cloudera.org:8080/#/c/2592/10/java/kudu-mapreduce/src/main/java/org/kududb/mapreduce/KuduTableInputFormat.java
File 
java/kudu-mapreduce/src/main/java/org/kududb/mapreduce/KuduTableInputFormat.java:

Line 135
> Should we deprecate this API now that scan tokens should be used for genera
I think it's still useful for applications trying to learn about the cluster.  
E.G. I used a similar API (albeit on the c++ side) for the shell.


Line 273:   static class ScanTokenSplit extends InputSplit implements Writable, 
Comparable<ScanTokenSplit> {
> Just curious: why rename the class?
reverted.


Line 275:     private byte[] token;
> Would be useful to explain here (or in the class Javadoc) how the token and
The partition key is used for sorting, and it has to be stored ahead of time 
because the scan token serialized form is meant to be completely opaque, a 
private implementation detail.


Line 328:       // We currently just care about the row key since we're within 
the same table
> Nit: update.
Done


Line 344:       return Objects.toStringHelper(this)
> Don't want to dump a textual version of the token here?
The serialized scan token is meant to be opaque.


Line 368:         scanner = ScanToken.deserializeIntoScanner(split.token, 
client);
> Nit: access split.token through a getter. It wasn't obvious that a private 
Done


http://gerrit.cloudera.org:8080/#/c/2592/10/java/kudu-mapreduce/src/main/java/org/kududb/mapreduce/KuduTableMapReduceUtil.java
File 
java/kudu-mapreduce/src/main/java/org/kududb/mapreduce/KuduTableMapReduceUtil.java:

Line 17: import org.apache.commons.io.output.ByteArrayOutputStream;
> Shouldn't this be java.util.io.ByteArrayOutputStream?
Done


http://gerrit.cloudera.org:8080/#/c/2592/10/java/kudu-mapreduce/src/test/java/org/kududb/mapreduce/TestInputFormatJob.java
File 
java/kudu-mapreduce/src/test/java/org/kududb/mapreduce/TestInputFormatJob.java:

Line 36
> Personally I find a wildcard static import of Assert acceptable. I imagine 
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2592
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I20eff9bf51e893226fc3bc47726565ca62c054e3
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to