Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/23122 )
Change subject: IMPALA-14094: Calcite planner: Use table and column statistics for optimization ...................................................................... Patch Set 3: (11 comments) Sending some comments based on a first pass. Will continue reviewing. http://gerrit.cloudera.org:8080/#/c/23122/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/23122/3//COMMIT_MSG@23 PS3, Line 23: TableScan: Gets the row count from the Table object. returns null if no In this current patch, In CalciteTable, getRowCount() is returning a Double.MAX_VALUE if stats are not present. I don't see null being returned for the row count (the return type is primitive double which is ok). http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/CalciteTable.java File java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/CalciteTable.java: http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/CalciteTable.java@226 PS3, Line 226: return table_.getNumRows() < 0 ? Double.MAX_VALUE : (double) table_.getNumRows(); For the uninitialized stats case, returning Double.MAX_VALUE could lead to overflows in cases such as UNION operator or a disjunctive predicate where we are adding row counts from 2 or more inputs. Have you considered that ? Why not just return the -1 and let the caller handle it as needed ? Each caller may have different ways of dealing with uninitialized stats. http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java File java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java: http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@130 PS3, Line 130: return ((double) 1 / (double) 3); nit: you could define a constant double value and return that instead of doing a division for every invocation. http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@133 PS3, Line 133: return ((double) 1 / (double) 9); nit: same as above http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@137 PS3, Line 137: return computeEqualsSelectivity(call); For default, I would think we should return 1.0. Equals selectivity is quite specific. Is this coming from any existing implementation in either Calcite or Impala ? In Calcite, there's also a guessSelectivity() although that's mainly used when stats are not available: https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/metadata/RelMdUtil.java#L474 http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@160 PS3, Line 160: * @return nit: add the description for the returned value http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@198 PS3, Line 198: //XXX: nit: remove or replace with TODO or some such. http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@217 PS3, Line 217: * @return nit: add the description for return value (here and in other funtions) http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdDistinctRowCount.java File java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdDistinctRowCount.java: http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdDistinctRowCount.java@57 PS3, Line 57: // no stats available, return null The comment says return null but you are returning the Double.MAX_VALUE. It's odd to treat Double.MAX_VALUE as a proxy for absence of stats. Is there a way to be consistent everywhere and use the row count < 0 as the accurate reflection of absence of stats ? It would help simplify the code. http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java File java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java: http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java@45 PS3, Line 45: RelMdCount nit: RelMdRowCount http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java@48 PS3, Line 48: public class ImpalaRelMdRowCount extends RelMdRowCount { This class has overridden implementations for certain operators. For others, in particular the Set operators (union, intersect, except), are we good with the default Calcite implementation ? -- To view, visit http://gerrit.cloudera.org:8080/23122 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d5bb50eb562c28e4b7c7a6529d140f98e77295c Gerrit-Change-Number: 23122 Gerrit-PatchSet: 3 Gerrit-Owner: Steve Carlin <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Anonymous Coward (816) Gerrit-Reviewer: Fang-Yu Rao <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Comment-Date: Sun, 13 Jul 2025 18:10:09 +0000 Gerrit-HasComments: Yes
