Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16792 )
Change subject: IMPALA-10360: Allow simple limit to be treated as sampling hint ...................................................................... Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java: http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@223 PS2, Line 223: if (getTableRefs().size() == 1) > By looking at the following view DDL, I have the impression that the conver Yes, the convert_limit_to_sample hint is per table only. The expectation is that a user may want to apply that for the fact table typically but not the dimension table. http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java: http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209 PS2, Line 209: estimatedTotalRows > Okay. It seems getTable().getNumRows() returns the raw row count as recorde The TABLESAMPLE is a long type, so yeah the minimum can be 1%. You're right that the sampling is getting applied after partition pruning but I just want to make clear that there are 2 types of partition pruning: (a) based on predicates on partition column and (b) based on the simple limit. When this method is called, (a) has already been applied. If the sampling hint is provided we don't the pruning for (b) at all. We will use the supplied list of partitions and sample across all those partitions. Our docs (https://impala.apache.org/docs/build/html/topics/impala_tablesample.html) say this: ==== Partitioning: When you query a partitioned table, any partition pruning happens before Impala selects the data files to sample. For example, in a table partitioned by year, a query with WHERE year = 2017 and a TABLESAMPLE SYSTEM(10) clause would sample data files representing at least 10% of the bytes present in the 2017 partition. ==== The expectation of the user is that if they have supplied a sample percent, just use that against the final pruned partitions rather than inflating the percent. I could make the ratio better by considering a heuristic of uniform distribution across partitions and scaling down the total row count in the denominator by multiplying it with num_pruned_partitions/num_total_partitions. I want to avoid having to add up all the partition's row counts. All this is based on the row count... the alternative is the other option I mentioned before with having the percent specified in the hint which makes it explicit but I think in vast majority of cases since simple limit is small (10-100), having a minimum of 1% for a fact table even after partition pruning is going to be sufficient. In fact, it would have been useful to sample in fractional percentage e.g 0.01% of a 10B row table. -- To view, visit http://gerrit.cloudera.org:8080/16792 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ife05a5343c913006f7659949b327b63d3f10c04b Gerrit-Change-Number: 16792 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Shant Hovsepian <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Thu, 03 Dec 2020 18:23:53 +0000 Gerrit-HasComments: Yes
