Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.
......................................................................


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java
File fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java:

Line 68:     if (randomSeed_ != null) {
single line?


http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

Line 812:     result.numRows_ = numRows_;
this is a bit weird outside of the context of sampling (why should #rows stay 
the same for a different set of files?).

instead of doing the sampling in the metadata, would it make more sense to do 
it during scan range computation?


http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 1961:     List<Pair<Long, FileDescriptor>> allFiles =
this is an expensive list, and i don't think you need it. you can achieve the 
same effect with a list that mirrors orderedparts and contains the cumulative # 
of files up to the preceding partition (e.g., {0, 517, 1120, ...}). you then 
generate a random sequence of numbers between 0 and #totalfiles-1. this 
requires space proportional to the sample size, but not the total number of 
files.

you could even return this as a map<partitionid, set<int>> (the value being the 
indices into the per-partition list of file descriptors), and have hdfsscannode 
use that to access the sampled files in the original partitions.


http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 132:   // Total number of files from the effective partitions (partitions_ 
or sample_).
update comment


Line 135:   // Total number of bytes from the effective partitions (partitions_ 
or sample_).
same


Line 678:           cardinality_ = addCardinalities(cardinality_, 
p.getNumRows());
shouldn't the cardinality reflect the reduction from sampling? if you're 
sampling, you reduce scan output, which should affect join order, no?


http://gerrit.cloudera.org:8080/#/c/6868/3/tests/query_test/test_tablesample.py
File tests/query_test/test_tablesample.py:

Line 47:     # 2. The results of queries without a repeatable clause could be 
change due to
remove "be"


-- 
To view, visit http://gerrit.cloudera.org:8080/6868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to