Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
......................................................................


Patch Set 11:

(4 comments)

> Patch Set 11:
>
> (4 comments)
>
> Just a couple corner cases I have run into; given this is an opt-in 
> optimization now it might not be incorrect to ignore these.
>
> I think it's good to think about the case where this optimization helps and 
> not risk an incorrect limit in other cases. Where this helps most.
> a. lots of files
> b. small limits
>
> a) the scan range and scheduling overhead is only slow when there are many 
> hosts + files.
>
> b) for large limits maybe the bulk of query run time goes to fetching results 
> and not the planning, but that said it may not hurt too much in this case.

Thanks for the comments. I will create a follow-up JIRA to address couple of 
these comments considering that this CR was +2 ed.
Note that the more generalized issue of optimizing for limits is something that 
Tim and I had some offline discussion about and he created 'IMPALA-10347: 
Explore approaches to optimizing queries that will likely be short-circuited by 
limits'

http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@854
PS11, Line 854:       FileSystem partitionFs;
> I'd consider only doing this optimization for "record oriented" or "splitta
It simplifies the logic and our internal testing at least if we could apply it 
across the board..perhaps for text format we have an allowance that 10% more 
files be considered to accommodate 'invalid' files .. will that be acceptable ?


http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@867
PS11, Line 867:           for (FileDescriptor fd : 
partition.getFileDescriptors()) {
> Also maybe a threshold on the number of scan ranges where we wouldn't bothe
This is related to your other comment below about bailing out under certain 
conditions. I'll try to run the benchmark .. that's a good idea.


http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@873
PS11, Line 873:             simpleLimitNumRows++;  // conservatively estimate 1 
row per file
> The flip side of this might be for simple limits that are relatively large
One complication is that the total number of files is not known up front 
(unless we aggregate it up front). We are pruning at 2 levels: once in the 
HdfsPartitionPruner where we limit the number of partitions and then here where 
we limit the number of files per partition.  In both places, as each partition 
is processed, we look at the # files but don't know the total.  We could decide 
to do the aggregation of the num files by making a separate pass over all 
partitions during HdfsPartitionPruner but that will add some overhead.


http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@1121
PS11, Line 1121:
> For Hive ACID and deleted records would this logic still work? Might be a h
For ACID tables with deleted rows, the planner will internally create an Hash 
Anti Join to handle the not-in, so yeah the limit should not be applied in such 
cases because it is no longer a simple scan.  I will create a separate JIRA to 
handle that case since additional testing and code changes would be needed.
Thanks for raising this.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 11
Gerrit-Owner: Aman Sinha <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Shant Hovsepian <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Tue, 24 Nov 2020 02:06:28 +0000
Gerrit-HasComments: Yes

Reply via email to