Steve Carlin has posted comments on this change. ( http://gerrit.cloudera.org:8080/23122 )
Change subject: IMPALA-14094: Calcite planner: Use table and column statistics for optimization ...................................................................... Patch Set 3: (12 comments) Made a bit of a change with this commit... I originally thought that "no stats" meant no stats and we couldn't do anything with it. It seems I missed the code that exists in HdfsScanNode that estimates row count stats if the stats are not computed which is done at physical node compilatino time rather than when the stats are retrieved. Because Calcite needs this when doing its compilation, these estimations need to be done earlier than the physical compilation stage. There is a prependum to this commit now which refactors the estimation code so that it is shareable. http://gerrit.cloudera.org:8080/#/c/23122/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/23122/3//COMMIT_MSG@23 PS3, Line 23: TableScan: Gets the row count from the Table object. returns null if no > In this current patch, In CalciteTable, getRowCount() is returning a Double See general comment, this has changed a bit. http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/CalciteTable.java File java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/CalciteTable.java: http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/CalciteTable.java@226 PS3, Line 226: return table_.getNumRows() < 0 ? Double.MAX_VALUE : (double) table_.getNumRows(); > For the uninitialized stats case, returning Double.MAX_VALUE could lead to See general comment, this has changed a bit. http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java File java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java: http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@130 PS3, Line 130: return ((double) 1 / (double) 3); > nit: you could define a constant double value and return that instead of do Done http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@133 PS3, Line 133: return ((double) 1 / (double) 9); > nit: same as above Done http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@137 PS3, Line 137: return computeEqualsSelectivity(call); > For default, I would think we should return 1.0. Equals selectivity is qui The logic is similar to what Hive was using: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L181 The function name is different though, since it isn't called "equals" there. I don't really have a preference on what to do here so I'll wait for your reply here. If you think we should change to 1.0, I can go along with that idea. Or perhaps file a Jira to investigate this more. http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@198 PS3, Line 198: //XXX: > nit: remove or replace with TODO or some such. Done http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@217 PS3, Line 217: * @return > nit: add the description for return value (here and in other funtions) Done http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdDistinctRowCount.java File java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdDistinctRowCount.java: http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdDistinctRowCount.java@57 PS3, Line 57: // no stats available, return null > The comment says return null but you are returning the Double.MAX_VALUE. I See general comment, this is no longer applicable. http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java File java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java: http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java@45 PS3, Line 45: RelMdCount > nit: RelMdRowCount Done http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java@48 PS3, Line 48: public class ImpalaRelMdRowCount extends RelMdRowCount { > This class has overridden implementations for certain operators. For other For now, I would say yes. This is something that may require further investigation at some point. I removed the Aggregate one, so the only ones overridden right now are: TableScan (we need to get our specific row counts Filter: We do our own filter estimations based on our stats Join: We have our own join calculations, the Calcite default ones didn't do well for us. We may find at some point there are some Agg ones that can be improved upon, but I'm willing to stick with the default for now until more investigation is done. http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/test/java/org/apache/impala/planner/TestCalciteStats.java File java/calcite-planner/src/test/java/org/apache/impala/planner/TestCalciteStats.java: http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/test/java/org/apache/impala/planner/TestCalciteStats.java@66 PS3, Line 66: import static org.junit.Assert.*; > nit: import statement with wildcard is not recommended for Impala (I thoug Done http://gerrit.cloudera.org:8080/#/c/23122/3/testdata/workloads/functional-planner/queries/PlannerTest/calcite_tpcds/tpcds-q63.test File testdata/workloads/functional-planner/queries/PlannerTest/calcite_tpcds/tpcds-q63.test: http://gerrit.cloudera.org:8080/#/c/23122/3/testdata/workloads/functional-planner/queries/PlannerTest/calcite_tpcds/tpcds-q63.test@260 PS3, Line 260: cardinality=327.58K > We used to also display in parenthesis "(filtered from <cardinality>)" .. There are some places where it is still displayed (see line 141). I think this is ok? Perhaps it just didn't produce any savings by the runtime filter here. -- To view, visit http://gerrit.cloudera.org:8080/23122 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d5bb50eb562c28e4b7c7a6529d140f98e77295c Gerrit-Change-Number: 23122 Gerrit-PatchSet: 3 Gerrit-Owner: Steve Carlin <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Anonymous Coward (816) Gerrit-Reviewer: Fang-Yu Rao <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Steve Carlin <[email protected]> Gerrit-Comment-Date: Thu, 17 Jul 2025 20:09:09 +0000 Gerrit-HasComments: Yes
