Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 )
Change subject: IMPALA-10314: Optimize planning time for simple limits ...................................................................... Patch Set 11: (4 comments) > Patch Set 11: > > (4 comments) > > Just a couple corner cases I have run into; given this is an opt-in > optimization now it might not be incorrect to ignore these. > > I think it's good to think about the case where this optimization helps and > not risk an incorrect limit in other cases. Where this helps most. > a. lots of files > b. small limits > > a) the scan range and scheduling overhead is only slow when there are many > hosts + files. > > b) for large limits maybe the bulk of query run time goes to fetching results > and not the planning, but that said it may not hurt too much in this case. Thanks for the comments. I will create a follow-up JIRA to address couple of these comments considering that this CR was +2 ed. Note that the more generalized issue of optimizing for limits is something that Tim and I had some offline discussion about and he created 'IMPALA-10347: Explore approaches to optimizing queries that will likely be short-circuited by limits' http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@854 PS11, Line 854: FileSystem partitionFs; > I'd consider only doing this optimization for "record oriented" or "splitta It simplifies the logic and our internal testing at least if we could apply it across the board..perhaps for text format we have an allowance that 10% more files be considered to accommodate 'invalid' files .. will that be acceptable ? http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@867 PS11, Line 867: for (FileDescriptor fd : partition.getFileDescriptors()) { > Also maybe a threshold on the number of scan ranges where we wouldn't bothe This is related to your other comment below about bailing out under certain conditions. I'll try to run the benchmark .. that's a good idea. http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@873 PS11, Line 873: simpleLimitNumRows++; // conservatively estimate 1 row per file > The flip side of this might be for simple limits that are relatively large One complication is that the total number of files is not known up front (unless we aggregate it up front). We are pruning at 2 levels: once in the HdfsPartitionPruner where we limit the number of partitions and then here where we limit the number of files per partition. In both places, as each partition is processed, we look at the # files but don't know the total. We could decide to do the aggregation of the num files by making a separate pass over all partitions during HdfsPartitionPruner but that will add some overhead. http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/test/java/org/apache/impala/planner/PlannerTest.java File fe/src/test/java/org/apache/impala/planner/PlannerTest.java: http://gerrit.cloudera.org:8080/#/c/16723/11/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@1121 PS11, Line 1121: > For Hive ACID and deleted records would this logic still work? Might be a h For ACID tables with deleted rows, the planner will internally create an Hash Anti Join to handle the not-in, so yeah the limit should not be applied in such cases because it is no longer a simple scan. I will create a separate JIRA to handle that case since additional testing and code changes would be needed. Thanks for raising this. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 11 Gerrit-Owner: Aman Sinha <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Shant Hovsepian <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Tue, 24 Nov 2020 02:06:28 +0000 Gerrit-HasComments: Yes
