Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16792 )

Change subject: IMPALA-10360: Allow simple limit to be treated as sampling hint
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@223
PS2, Line 223: if (getTableRefs().size() == 1)
> By looking at the following view DDL, I have the impression that the conver
Yes, the convert_limit_to_sample hint is per table only.  The expectation is 
that a user may want to apply that for the fact table typically but not the 
dimension table.


http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java:

http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209
PS2, Line 209: estimatedTotalRows
> Okay. It seems getTable().getNumRows() returns the raw row count as recorde
The TABLESAMPLE is a long type, so yeah the minimum can be 1%.  You're right 
that the sampling is getting applied after  partition pruning but I just want 
to make clear that there are 2 types of partition pruning: (a) based on 
predicates on partition column and (b) based on the simple limit.   When this 
method is called, (a) has already been applied.  If the sampling hint is 
provided we don't the pruning for (b) at all.   We will use the supplied list 
of partitions and sample across all those partitions.

Our docs 
(https://impala.apache.org/docs/build/html/topics/impala_tablesample.html) say 
this:
====
Partitioning:
When you query a partitioned table, any partition pruning happens before Impala 
selects the data files to sample. For example, in a table partitioned by year, 
a query with WHERE year = 2017 and a TABLESAMPLE SYSTEM(10) clause would sample 
data files representing at least 10% of the bytes present in the 2017 partition.
====

The expectation of the user is that if they have supplied a sample percent, 
just use that against the final pruned partitions rather than inflating the 
percent.  I could make the ratio better by considering a heuristic of uniform 
distribution across partitions and scaling down the total row count in the 
denominator by multiplying it with num_pruned_partitions/num_total_partitions.  
I want to avoid having to add up all the partition's row counts.

All this is based on the  row count... the alternative is the other option I 
mentioned before with having the percent specified in the hint which makes it 
explicit but I think in vast majority of cases since simple limit is small 
(10-100), having a minimum of 1% for a fact table even after partition pruning 
is going to be sufficient.  In fact, it would have been useful to sample in 
fractional percentage e.g 0.01% of a 10B row table.



--
To view, visit http://gerrit.cloudera.org:8080/16792
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ife05a5343c913006f7659949b327b63d3f10c04b
Gerrit-Change-Number: 16792
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Shant Hovsepian <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Thu, 03 Dec 2020 18:23:53 +0000
Gerrit-HasComments: Yes

Reply via email to