Marcel Kornacker has posted comments on this change. Change subject: IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs. ......................................................................
Patch Set 3: (7 comments) http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java File fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java: Line 68: if (randomSeed_ != null) { single line? http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: Line 812: result.numRows_ = numRows_; this is a bit weird outside of the context of sampling (why should #rows stay the same for a different set of files?). instead of doing the sampling in the metadata, would it make more sense to do it during scan range computation? http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: Line 1961: List<Pair<Long, FileDescriptor>> allFiles = this is an expensive list, and i don't think you need it. you can achieve the same effect with a list that mirrors orderedparts and contains the cumulative # of files up to the preceding partition (e.g., {0, 517, 1120, ...}). you then generate a random sequence of numbers between 0 and #totalfiles-1. this requires space proportional to the sample size, but not the total number of files. you could even return this as a map<partitionid, set<int>> (the value being the indices into the per-partition list of file descriptors), and have hdfsscannode use that to access the sampled files in the original partitions. http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 132: // Total number of files from the effective partitions (partitions_ or sample_). update comment Line 135: // Total number of bytes from the effective partitions (partitions_ or sample_). same Line 678: cardinality_ = addCardinalities(cardinality_, p.getNumRows()); shouldn't the cardinality reflect the reduction from sampling? if you're sampling, you reduce scan output, which should affect join order, no? http://gerrit.cloudera.org:8080/#/c/6868/3/tests/query_test/test_tablesample.py File tests/query_test/test_tablesample.py: Line 47: # 2. The results of queries without a repeatable clause could be change due to remove "be" -- To view, visit http://gerrit.cloudera.org:8080/6868 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-HasComments: Yes