Steve Carlin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23122 )

Change subject: IMPALA-14094: Calcite planner: Use table and column statistics 
for optimization
......................................................................


Patch Set 3:

(12 comments)

Made a bit of a change with this commit...

I originally thought that "no stats" meant no stats and we couldn't do anything 
with it.  It seems I missed the code that exists in HdfsScanNode that estimates 
row count stats if the stats are not computed which is done at physical node 
compilatino time rather than when the stats are retrieved.

Because Calcite needs this when doing its compilation, these estimations need 
to be done earlier than the physical compilation stage. There is a prependum to 
this commit now which refactors the estimation code so that it is shareable.

http://gerrit.cloudera.org:8080/#/c/23122/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/23122/3//COMMIT_MSG@23
PS3, Line 23:     TableScan: Gets the row count from the Table object. returns 
null if no
> In this current patch, In CalciteTable, getRowCount() is returning a Double
See general comment, this has changed a bit.


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/CalciteTable.java
File 
java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/CalciteTable.java:

http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/CalciteTable.java@226
PS3, Line 226:     return table_.getNumRows() < 0 ? Double.MAX_VALUE : (double) 
table_.getNumRows();
> For the uninitialized stats case, returning Double.MAX_VALUE could lead to
See general comment, this has changed a bit.


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java
File 
java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java:

http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@130
PS3, Line 130:         return ((double) 1 / (double) 3);
> nit: you could define a constant double value and return that instead of do
Done


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@133
PS3, Line 133:         return ((double) 1 / (double) 9);
> nit: same as above
Done


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@137
PS3, Line 137:         return computeEqualsSelectivity(call);
> For default, I would think we should return 1.0.  Equals selectivity is qui
The logic is similar to what Hive was using: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L181

The function name is different though, since it isn't called "equals" there.

I don't really have a preference on what to do here so I'll wait for your reply 
here. If you think we should change to 1.0, I can go along with that idea.  Or 
perhaps file a Jira to investigate this more.


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@198
PS3, Line 198: //XXX:
> nit: remove or replace with TODO or some such.
Done


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/FilterSelectivityEstimator.java@217
PS3, Line 217:    * @return
> nit: add the description for return value (here and in other funtions)
Done


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdDistinctRowCount.java
File 
java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdDistinctRowCount.java:

http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdDistinctRowCount.java@57
PS3, Line 57:     // no stats available, return null
> The comment says return null but you are returning the Double.MAX_VALUE.  I
See general comment, this is no longer applicable.


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java
File 
java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java:

http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java@45
PS3, Line 45: RelMdCount
> nit: RelMdRowCount
Done


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/main/java/org/apache/impala/calcite/schema/ImpalaRelMdRowCount.java@48
PS3, Line 48: public class ImpalaRelMdRowCount extends RelMdRowCount {
> This class has overridden implementations for certain operators.  For other
For now, I would say yes.  This is something that may require further 
investigation at some point.  I removed the Aggregate one, so the only ones 
overridden right now are:

TableScan (we need to get our specific row counts
Filter: We do our own filter estimations based on our stats
Join: We have our own join calculations, the Calcite default ones didn't do 
well for us.

We may find at some point there are some Agg ones that can be improved upon, 
but I'm willing to stick with the default for now until more investigation is 
done.


http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/test/java/org/apache/impala/planner/TestCalciteStats.java
File 
java/calcite-planner/src/test/java/org/apache/impala/planner/TestCalciteStats.java:

http://gerrit.cloudera.org:8080/#/c/23122/3/java/calcite-planner/src/test/java/org/apache/impala/planner/TestCalciteStats.java@66
PS3, Line 66: import static org.junit.Assert.*;
> nit:  import statement with wildcard is not recommended for Impala (I thoug
Done


http://gerrit.cloudera.org:8080/#/c/23122/3/testdata/workloads/functional-planner/queries/PlannerTest/calcite_tpcds/tpcds-q63.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/calcite_tpcds/tpcds-q63.test:

http://gerrit.cloudera.org:8080/#/c/23122/3/testdata/workloads/functional-planner/queries/PlannerTest/calcite_tpcds/tpcds-q63.test@260
PS3, Line 260: cardinality=327.58K
> We used to also display in parenthesis "(filtered from <cardinality>)"  ..
There are some places where it is still displayed (see line 141).  I think this 
is ok?  Perhaps it just didn't produce any savings by the runtime filter here.



--
To view, visit http://gerrit.cloudera.org:8080/23122
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d5bb50eb562c28e4b7c7a6529d140f98e77295c
Gerrit-Change-Number: 23122
Gerrit-PatchSet: 3
Gerrit-Owner: Steve Carlin <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Anonymous Coward (816)
Gerrit-Reviewer: Fang-Yu Rao <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Steve Carlin <[email protected]>
Gerrit-Comment-Date: Thu, 17 Jul 2025 20:09:09 +0000
Gerrit-HasComments: Yes

Reply via email to