[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

Aman Sinha (Code Review) Tue, 17 Nov 2020 11:41:28 -0800

Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )


Change subject: IMPALA-10314: Optimize planning time for simple limits
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift@601
PS3, Line 601: (1 row per file)
> Sounds like for parquet/orc, we can take a sample of few data files and get
Yeah, I had initially thought of that but the file metadata as it exists today  
see [1] is not customized for Parquet or ORC so it does not keep track of the 
rowcount per file (we do have  numRows_ per partition but that's an estimate 
obtained from hive metastore). We could certainly do more aggressive pruning 
during planning time if we leverage that.  This could be a future enhancement 
(maybe there's an existing jira.. I will check).

[1] 
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L103


http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@180
PS3, Line 180: order-by
> Sounds like we should allow the order by clause since it does not increase/
Not sure what you meant here.  In the presence of order-by, we cannot push the 
limit to the scan anyways.


http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209
PS3, Line 209: for (FeFsPartition p : partitions) {
             :         numRows += p.getNumFileDescriptors();
             :         prunedPartitions.add(p);
             :         if (numRows >= analyzer.getSimpleLimitStatus().second) {
             :           // here we only limit the partitions; later in 
HdfsScanNode we will
             :           // limit the file descriptors within a partition
             :           break;
             :         }
> I wonder if the prunnedPartitions returned here are the only ones to be sca
Yes, these are the pruned partitions that will be sent over to the runtime. 
Regarding the WHERE clause,  Tim had raised similar concern and we had some 
offline discussion about how to safely allow predicates.  The reason for 
allowing at least some type of predicate is that for various reasons the user 
may be doing a limit on top of a view that has a predicate that is always true. 
 For example,  https://issues.apache.org/jira/browse/IMPALA-10064 describes one 
such use case we have seen.

Tim had a suggestion for using a hint for the where clause (where /* 
+always_true */) and I agree that it is a safe option since it puts the onus on 
the user.  I will update the patch with those changes.

For randomly picking a subset of partitions..hmm..my initial thought is it may 
be over-doing the optimization. Let's see the adoption of this functionality in 
a concurrent workload scenario..note that the optimize_simple_limit is disabled 
by default.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Shant Hovsepian <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Tue, 17 Nov 2020 19:41:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

Reply via email to