Alex Behm has posted comments on this change. Change subject: IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs. ......................................................................
Patch Set 3: (7 comments) http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java File fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java: Line 68: if (randomSeed_ != null) { > single line? Done http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: Line 812: result.numRows_ = numRows_; > this is a bit weird outside of the context of sampling (why should #rows st Might look weird at first glance, but I don't think it is because the #rows statistic does not necessarily correspond to the current file descriptors especially with sampling. I could not find a reasonable way to integrate the sampling into computeScanRangeLocations(). Details discussed in another comment thread. http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: Line 1961: List<Pair<Long, FileDescriptor>> allFiles = > I think it overall makes more sense to integrate the sampling into HdfsScan I tried integrating the sampling into computeScanRangeLocations() but ran into the same question of how to avoid selecting the same file multiple times (without retry loop). I don't really see how we efficiently solve this problem without a solution that looks similar to what is already here. I changed this code function to be more efficient, but fundamentally the approach is the same. This function now returns a map like you suggested, but I'm not really sure that is cleaner than the partition cloning approach. http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 132: // Total number of files from the effective partitions (partitions_ or sample_). > update comment Done Line 135: // Total number of bytes from the effective partitions (partitions_ or sample_). > same Done Line 678: cardinality_ = addCardinalities(cardinality_, p.getNumRows()); > shouldn't the cardinality reflect the reduction from sampling? if you're sa The cardinality is adjusted in L694. In a similar place in the new patch. http://gerrit.cloudera.org:8080/#/c/6868/3/tests/query_test/test_tablesample.py File tests/query_test/test_tablesample.py: Line 47: # 2. The results of queries without a repeatable clause could be change due to > remove "be" Done -- To view, visit http://gerrit.cloudera.org:8080/6868 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-HasComments: Yes
