Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16792 )
Change subject: IMPALA-10360: Allow simple limit to be treated as sampling hint ...................................................................... Patch Set 5: (3 comments) http://gerrit.cloudera.org:8080/#/c/16792/5/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java: http://gerrit.cloudera.org:8080/#/c/16792/5/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@223 PS5, Line 223: if (getTableRefs().size() == 1) return true; > Should we remove this? It seems hasConvertLimitToSampleHint() can return tr We cannot remove this because as I mentioned in a previous comment (patchset 2) the table level hint is not required in order for simple limit optimization to be applied. For example, there are 2 cases: 1. select * from (select * from t where [always_true] a > 0) limit 10; 2. select * from (select * from t [convert_limit_to_sample] where [always_true] a > 0) limit 10; In both cases, we want to be able to apply the optimization. In case 1, it will just pick first 10 files while in case 2 it will sample across multiple partitions. Case 1 will typically be much faster planning time, so we should support that. http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java: http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209 PS2, Line 209: estimatedTotalRows > Sounds about right. I haven't looked into why the past decision was to only support whole numbers for the sampling but probably the use case wasn't there to motivate supporting fractional values. You may want to look into the history but yeah as I said smaller sample size would be useful in this situation. http://gerrit.cloudera.org:8080/#/c/16792/5/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java: http://gerrit.cloudera.org:8080/#/c/16792/5/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@217 PS5, Line 217: partitions.size()/numTotalPartitions > Cool! This will work well when the partitions are about the same size, whic Yes, there is an assumption of uniform distribution (I should add a comment) as this is still a heuristic. I would like to avoid the per partition numRows estimate since that can be way off and in the past Tim has also discouraged using it. The total estimated row count can be completely off as well, so I acknowledge there's weakness here. Even if it was accurate, I also didn't want to add a for loop to add up the numRows of the surviving partitions since that could potentially run into tens of thousands or hundreds of thousands (especially if no pruning happens which is quite common). I am beginning to think the only foolproof way is to let the user specify exact percentage in the hint. e.g [convert_limit_to_sample=5]. This guarantees a 5% sampling of surviving partitions if limit is present and does not rely on stats etc. What do you guys think ? I will have to add a bit of parsing logic to the hint processing. -- To view, visit http://gerrit.cloudera.org:8080/16792 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ife05a5343c913006f7659949b327b63d3f10c04b Gerrit-Change-Number: 16792 Gerrit-PatchSet: 5 Gerrit-Owner: Aman Sinha <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Shant Hovsepian <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Fri, 04 Dec 2020 18:57:36 +0000 Gerrit-HasComments: Yes
