[
https://issues.apache.org/jira/browse/FLINK-37315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-37315:
-----------------------------------
Labels: pull-request-available (was: )
> Improve SelectivityEstimator#estimateEquals for columns in VARCHAR type
> -----------------------------------------------------------------------
>
> Key: FLINK-37315
> URL: https://issues.apache.org/jira/browse/FLINK-37315
> Project: Flink
> Issue Type: Improvement
> Components: Table SQL / Planner
> Reporter: Biao Geng
> Priority: Minor
> Labels: pull-request-available
>
> Currently in the planner, we utilize *columnInterval* to estimate a
> percentage of rows meeting an equality (=) expression (see the
> [link|https://github.com/apache/flink/blob/05a3e9c578e9efe8755058d1f7f1b8e71c456645/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/metadata/SelectivityEstimator.scala#L546]
> for more details.
> While this approach is effective for numeric columns (e.g., {{{}INT{}}},
> {{{}BIGINT{}}}), it shows limitations for {{VARCHAR}} type columns. Since
> character-based columns lack a native concept of "interval", the existing
> implementation falls back to returning {{defaultEqualsSelectivity}} for these
> cases.
> Actually we can still use ndv(number of distinct values) to get a better
> guess of the {{VARCHAR}} columns(i.e. using 1 / ndv) for equality check. This
> is also what Spark does in its
> [planner|https://github.com/apache/spark/blob/0a102327d52712f3947cb90da671ffb18be62626/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L338].
> We verified this change using a 10TB TPC-DS benchmark dataset, observing
> improvements:
> * *30% reduction* in execution time for queries #13, #48, #85
> * *10%* for queries #67, #95
--
This message was sent by Atlassian Jira
(v8.20.10#820010)