[ 
https://issues.apache.org/jira/browse/FLINK-37315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-37315:
-----------------------------------
    Labels: pull-request-available  (was: )

> Improve SelectivityEstimator#estimateEquals for columns in VARCHAR type
> -----------------------------------------------------------------------
>
>                 Key: FLINK-37315
>                 URL: https://issues.apache.org/jira/browse/FLINK-37315
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table SQL / Planner
>            Reporter: Biao Geng
>            Priority: Minor
>              Labels: pull-request-available
>
> Currently in the planner, we utilize *columnInterval* to estimate a 
> percentage of rows meeting an equality (=) expression (see the 
> [link|https://github.com/apache/flink/blob/05a3e9c578e9efe8755058d1f7f1b8e71c456645/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/metadata/SelectivityEstimator.scala#L546]
>  for more details.
> While this approach is effective for numeric columns (e.g., {{{}INT{}}}, 
> {{{}BIGINT{}}}), it shows limitations for {{VARCHAR}} type columns. Since 
> character-based columns lack a native concept of "interval", the existing 
> implementation falls back to returning {{defaultEqualsSelectivity}} for these 
> cases.
> Actually we can still use ndv(number of distinct values) to get a better 
> guess of the {{VARCHAR}} columns(i.e. using 1 / ndv) for equality check. This 
> is also what Spark does in its 
> [planner|https://github.com/apache/spark/blob/0a102327d52712f3947cb90da671ffb18be62626/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L338].
> We verified this change using a 10TB TPC-DS benchmark dataset, observing 
> improvements:
>  * *30% reduction* in execution time for queries #13, #48, #85
>  * *10%* for queries #67, #95



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to