Biao Geng created FLINK-37315:
---------------------------------
Summary: Improve SelectivityEstimator#estimateEquals for columns
in VARCHAR type
Key: FLINK-37315
URL: https://issues.apache.org/jira/browse/FLINK-37315
Project: Flink
Issue Type: Improvement
Components: Table SQL / Planner
Reporter: Biao Geng
Currently in the planner, we utilize *columnInterval* to estimate a percentage
of rows meeting an equality (=) expression (see the
[link|https://github.com/apache/flink/blob/05a3e9c578e9efe8755058d1f7f1b8e71c456645/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/metadata/SelectivityEstimator.scala#L546]
for more details.
While this approach is effective for numeric columns (e.g., {{{}INT{}}},
{{{}BIGINT{}}}), it shows limitations for {{VARCHAR}} type columns. Since
character-based columns lack a native concept of "interval", the existing
implementation falls back to returning {{defaultEqualsSelectivity}} for these
cases.
Actually we can still use ndv(number of distinct values) to get a better guess
of the {{VARCHAR}} columns(i.e. using 1 / ndv) for equality check. This is also
what Spark does in its
[planner|https://github.com/apache/spark/blob/0a102327d52712f3947cb90da671ffb18be62626/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L338].
We verified this change using a 10TB TPC-DS benchmark dataset, observing
improvements:
* *30% reduction* in execution time for queries #13, #48, #85
* *10%* for queries #67, #95
--
This message was sent by Atlassian Jira
(v8.20.10#820010)