Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 )
Change subject: IMPALA-10314: Optimize planning time for simple limits ...................................................................... Patch Set 3: (3 comments) http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift@601 PS3, Line 601: (1 row per file) > Sounds like for parquet/orc, we can take a sample of few data files and get Yeah, I had initially thought of that but the file metadata as it exists today see [1] is not customized for Parquet or ORC so it does not keep track of the rowcount per file (we do have numRows_ per partition but that's an estimate obtained from hive metastore). We could certainly do more aggressive pruning during planning time if we leverage that. This could be a future enhancement (maybe there's an existing jira.. I will check). [1] https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L103 http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@180 PS3, Line 180: order-by > Sounds like we should allow the order by clause since it does not increase/ Not sure what you meant here. In the presence of order-by, we cannot push the limit to the scan anyways. http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209 PS3, Line 209: for (FeFsPartition p : partitions) { : numRows += p.getNumFileDescriptors(); : prunedPartitions.add(p); : if (numRows >= analyzer.getSimpleLimitStatus().second) { : // here we only limit the partitions; later in HdfsScanNode we will : // limit the file descriptors within a partition : break; : } > I wonder if the prunnedPartitions returned here are the only ones to be sca Yes, these are the pruned partitions that will be sent over to the runtime. Regarding the WHERE clause, Tim had raised similar concern and we had some offline discussion about how to safely allow predicates. The reason for allowing at least some type of predicate is that for various reasons the user may be doing a limit on top of a view that has a predicate that is always true. For example, https://issues.apache.org/jira/browse/IMPALA-10064 describes one such use case we have seen. Tim had a suggestion for using a hint for the where clause (where /* +always_true */) and I agree that it is a safe option since it puts the onus on the user. I will update the patch with those changes. For randomly picking a subset of partitions..hmm..my initial thought is it may be over-doing the optimization. Let's see the adoption of this functionality in a concurrent workload scenario..note that the optimize_simple_limit is disabled by default. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Shant Hovsepian <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Tue, 17 Nov 2020 19:41:20 +0000 Gerrit-HasComments: Yes
