Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16792 )
Change subject: IMPALA-10360: Allow simple limit to be treated as sampling hint ...................................................................... Patch Set 3: (5 comments) Thanks a lot for the explanation. Two additional questions follow :-). http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/CompoundPredicate.java File fe/src/main/java/org/apache/impala/analysis/CompoundPredicate.java: http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/CompoundPredicate.java@79 PS2, Line 79: if ((op == Operator.AND && : (Expr.IS_ALWAYS_TRUE_PREDICATE.apply(e1) && : Expr.IS_ALWAYS_TRUE_P > Makes sense to handle OR for completeness. I have added it. One thing tha Done http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/Expr.java File fe/src/main/java/org/apache/impala/analysis/Expr.java: http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/Expr.java@520 PS2, Line 520: if (conjuncts.size() > 1) { > For the single conjunct case it should have already been set on the line 51 Done http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java: http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@223 PS2, Line 223: if (getTableRefs().size() == 1) > Since this is a qualifying check to decide whether this query block is elig By looking at the following view DDL, I have the impression that the convert_limit_to_sample hint is only for table alltypes_date_partition_2, not for table alltypessmall. It is about right? ---- DATASET functional ---- BASE_TABLE_NAME alltypes_dp_2_view_2 ---- CREATE DROP VIEW IF EXISTS {db_name}{db_suffix}.{table_name}; -- view which references a table with hint and a WHERE clause with hint. -- WHERE clause has a compound predicate. CREATE VIEW {db_name}{db_suffix}.{table_name} AS SELECT * FROM {db_name}{db_suffix}.alltypes_date_partition_2 [convert_limit_to_sample] where [always_true] date_col = cast(timestamp_col as date) and int_col in (select bigint_col from functional.alltypessmall); ---- LOAD Looks like making it a table level hint helps provide extra level of safety. http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java: http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209 PS2, Line 209: estimatedTotalRows > Actually, the getTable().getNumRows() is the estimated row count. If stats Okay. It seems getTable().getNumRows() returns the raw row count as recorded in HMS for the table. In that case, your code ignores a table if the stats is missing or corrupt, which is good. 311 public void setTableStats(org.apache.hadoop.hive.metastore.api.Table msTbl) { 312 tableStats_ = new TTableStats(FeCatalogUtils.getRowCount(msTbl.getParameters())); 313 tableStats_.setTotal_file_bytes(FeCatalogUtils.getTotalSize(msTbl.getParameters())); 314 } 315 catalog/Table.java The current approach computes the sampling percentage automatically and has a lower bound of 1%. Since simple limit optimization reduces # of partitions to be scanned, I wonder if the sampling rate would still hold on the surviving partitions. For example, there are 4 partitions, p1, p2, p3, p4. The amount of data of the total in each partition: 20% (p1), 20% (p2), 50% (p3) and 10% (p4). Assume p4 is surviving. The for a limit that is close to #rows in p4, I assume we need almost 100% sample rate. http://gerrit.cloudera.org:8080/#/c/16792/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test File testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test: http://gerrit.cloudera.org:8080/#/c/16792/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@275 PS2, Line 275: select * from functional.alltypes_dp_2_view_2 limit 10; > Yup..makes sense. I added a query at the end that is against the same base Done -- To view, visit http://gerrit.cloudera.org:8080/16792 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ife05a5343c913006f7659949b327b63d3f10c04b Gerrit-Change-Number: 16792 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Shant Hovsepian <sh...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Thu, 03 Dec 2020 15:47:38 +0000 Gerrit-HasComments: Yes