Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16792 )
Change subject: IMPALA-10360: Allow simple limit to be treated as sampling hint ...................................................................... Patch Set 5: (3 comments) Looks good to me! http://gerrit.cloudera.org:8080/#/c/16792/5/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java: http://gerrit.cloudera.org:8080/#/c/16792/5/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@223 PS5, Line 223: if (getTableRefs().size() == 1) return true; Should we remove this? It seems hasConvertLimitToSampleHint() can return true or false depending on whether the hint has been set to the only table ref here. It could be not set. http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java: http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209 PS2, Line 209: estimatedTotalRows > I made this change to use a scaled down value of the estimated row count ( Sounds about right. I also like the idea to specify the sample size in terms of number rows, which will speed up the sampling of a few rows from a very large table, where %1 could be in the order of million rows. I can file a JIRA on this and work on it after the min/max work. http://gerrit.cloudera.org:8080/#/c/16792/5/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java: http://gerrit.cloudera.org:8080/#/c/16792/5/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@217 PS5, Line 217: partitions.size()/numTotalPartitions Cool! This will work well when the partitions are about the same size, which is mostly true with hash partitions. For other partition schemes with unequal sizes, such as range partitioning, I wonder if the use of HdfsPartition::numRows_ would work: sample rate = #rows to return / # rows in the surviving partitions. -- To view, visit http://gerrit.cloudera.org:8080/16792 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ife05a5343c913006f7659949b327b63d3f10c04b Gerrit-Change-Number: 16792 Gerrit-PatchSet: 5 Gerrit-Owner: Aman Sinha <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Shant Hovsepian <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Fri, 04 Dec 2020 14:34:48 +0000 Gerrit-HasComments: Yes
