Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23122 )

Change subject: IMPALA-14094: Calcite planner: Use table and column statistics 
for optimization
......................................................................


Patch Set 3:

(11 comments)

Sending some comments based on a first pass. Will continue reviewing.

http://gerrit.cloudera.org:8080/#/c/23122/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/23122/3//COMMIT_MSG@23
PS3, Line 23:     TableScan: Gets the row count from the Table object. returns 
null if no
In this current patch, In CalciteTable, getRowCount() is returning a 
Double.MAX_VALUE if stats are not present.  I don't see null being returned for 
the row count (the return type is primitive double which is ok).


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/CalciteTable.java
File 
java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/CalciteTable.java:

http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/CalciteTable.java@226
PS3, Line 226:     return table_.getNumRows() < 0 ? Double.MAX_VALUE : (double) 
table_.getNumRows();
For the uninitialized stats case, returning Double.MAX_VALUE could lead to 
overflows in cases such as UNION operator or a disjunctive predicate where we 
are adding row counts from 2 or more inputs.  Have you considered that ?  Why 
not just return the -1 and let the caller handle it as needed ?  Each caller 
may have different ways of dealing with uninitialized stats.


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java
File 
java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java:

http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@130
PS3, Line 130:         return ((double) 1 / (double) 3);
nit: you could define a constant double value and return that instead of doing 
a division for every invocation.


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@133
PS3, Line 133:         return ((double) 1 / (double) 9);
nit: same as above


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@137
PS3, Line 137:         return computeEqualsSelectivity(call);
For default, I would think we should return 1.0.  Equals selectivity is quite 
specific.  Is this coming from any existing implementation in either Calcite or 
Impala ?   In Calcite, there's also a guessSelectivity() although that's mainly 
used when stats are not available: 
https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/metadata/RelMdUtil.java#L474


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@160
PS3, Line 160:    * @return
nit: add the description for the returned value


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@198
PS3, Line 198: //XXX:
nit: remove or replace with TODO or some such.


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@217
PS3, Line 217:    * @return
nit: add the description for return value (here and in other funtions)


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdDistinctRowCount.java
File 
java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdDistinctRowCount.java:

http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdDistinctRowCount.java@57
PS3, Line 57:     // no stats available, return null
The comment says return null but you are returning the Double.MAX_VALUE.  It's 
odd to treat Double.MAX_VALUE as a proxy for absence of stats.  Is there a way 
to be consistent everywhere and use the  row count  < 0 as the accurate 
reflection of absence of stats ?  It would help simplify the code.


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java
File 
java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java:

http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java@45
PS3, Line 45: RelMdCount
nit: RelMdRowCount


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java@48
PS3, Line 48: public class ImpalaRelMdRowCount extends RelMdRowCount {
This class has overridden implementations for certain operators.  For others, 
in particular the Set operators (union, intersect, except), are we good with 
the default Calcite implementation ?



--
To view, visit http://gerrit.cloudera.org:8080/23122
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d5bb50eb562c28e4b7c7a6529d140f98e77295c
Gerrit-Change-Number: 23122
Gerrit-PatchSet: 3
Gerrit-Owner: Steve Carlin <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Anonymous Coward (816)
Gerrit-Reviewer: Fang-Yu Rao <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Comment-Date: Sun, 13 Jul 2025 18:10:09 +0000
Gerrit-HasComments: Yes

Reply via email to