Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16792 )

Change subject: IMPALA-10360: Allow simple limit to be treated as sampling hint
......................................................................


Patch Set 5:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16792/5/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/16792/5/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@223
PS5, Line 223: if (getTableRefs().size() == 1) return true;
> Should we remove this? It seems hasConvertLimitToSampleHint() can return tr
We cannot remove this because as I mentioned in a previous comment (patchset 2) 
the table level hint is not required in order for simple limit optimization to 
be applied.   For example, there are 2 cases:
 1.   select * from (select * from  t where [always_true] a > 0) limit 10;
 2.   select * from (select * from t [convert_limit_to_sample] where 
[always_true] a > 0) limit 10;
In both cases, we want to be able to apply the optimization.  In case 1, it 
will just pick first 10 files while in case 2 it will sample across multiple 
partitions.   Case 1 will typically be much faster planning time, so we should 
support that.


http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java:

http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209
PS2, Line 209: estimatedTotalRows
> Sounds about right.
I haven't looked into why the past decision was to only support whole numbers 
for the sampling but probably the use case wasn't there to motivate supporting 
fractional values. You may want to look into the history but yeah as I said 
smaller sample size would be useful in this situation.


http://gerrit.cloudera.org:8080/#/c/16792/5/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java:

http://gerrit.cloudera.org:8080/#/c/16792/5/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@217
PS5, Line 217: partitions.size()/numTotalPartitions
> Cool! This will work well when the partitions are about the same size, whic
Yes, there is an assumption of uniform distribution (I should add a comment) as 
this is still a heuristic. I would like to avoid the per partition numRows 
estimate since that can be way off and in the past Tim has also discouraged 
using it. The total estimated row count can be completely off as well, so I 
acknowledge there's weakness here.

Even if it was accurate, I also didn't want to add a for loop to add up the 
numRows of the surviving partitions since that could potentially run into tens 
of thousands or hundreds of thousands (especially if no pruning happens which 
is quite common).

I am beginning to think the only foolproof way is to let the user specify exact 
percentage in the hint.  e.g  [convert_limit_to_sample=5].  This guarantees a 
5% sampling of surviving partitions if limit is present and does not rely on 
stats etc.  What do you guys think ?  I will have to add a bit of parsing logic 
to the hint processing.



--
To view, visit http://gerrit.cloudera.org:8080/16792
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ife05a5343c913006f7659949b327b63d3f10c04b
Gerrit-Change-Number: 16792
Gerrit-PatchSet: 5
Gerrit-Owner: Aman Sinha <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Shant Hovsepian <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Fri, 04 Dec 2020 18:57:36 +0000
Gerrit-HasComments: Yes

Reply via email to