Alex Behm has posted comments on this change.

Change subject: IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.
......................................................................


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java
File fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java:

Line 68:     if (randomSeed_ != null) {
> single line?
Done


http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

Line 812:     result.numRows_ = numRows_;
> this is a bit weird outside of the context of sampling (why should #rows st
Might look weird at first glance, but I don't think it is because the #rows 
statistic does not necessarily correspond to the current file descriptors 
especially with sampling.

I could not find a reasonable way to integrate the sampling into 
computeScanRangeLocations(). Details discussed in another comment thread.


http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 1961:     List<Pair<Long, FileDescriptor>> allFiles =
> I think it overall makes more sense to integrate the sampling into HdfsScan
I tried integrating the sampling into computeScanRangeLocations() but ran into 
the same question of how to avoid selecting the same file multiple times 
(without retry loop). I don't really see how we efficiently solve this problem 
without a solution that looks similar to what is already here.

I changed this code function to be more efficient, but fundamentally the 
approach is the same.

This function now returns a map like you suggested, but I'm not really sure 
that is cleaner than the partition cloning approach.


http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 132:   // Total number of files from the effective partitions (partitions_ 
or sample_).
> update comment
Done


Line 135:   // Total number of bytes from the effective partitions (partitions_ 
or sample_).
> same
Done


Line 678:           cardinality_ = addCardinalities(cardinality_, 
p.getNumRows());
> shouldn't the cardinality reflect the reduction from sampling? if you're sa
The cardinality is adjusted in L694. In a similar place in the new patch.


http://gerrit.cloudera.org:8080/#/c/6868/3/tests/query_test/test_tablesample.py
File tests/query_test/test_tablesample.py:

Line 47:     # 2. The results of queries without a repeatable clause could be 
change due to
> remove "be"
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-HasComments: Yes

Reply via email to