[jira] [Created] (FLINK-37315) Improve SelectivityEstimator#estimateEquals for columns in VARCHAR type

Biao Geng (Jira) Wed, 12 Feb 2025 23:42:17 -0800

Biao Geng created FLINK-37315:
---------------------------------

             Summary: Improve SelectivityEstimator#estimateEquals for columns 
in VARCHAR type
                 Key: FLINK-37315
                 URL: https://issues.apache.org/jira/browse/FLINK-37315
             Project: Flink
          Issue Type: Improvement
          Components: Table SQL / Planner
            Reporter: Biao Geng



Currently in the planner, we utilize *columnInterval* to estimate a percentage 
of rows meeting an equality (=) expression (see the 
[link|https://github.com/apache/flink/blob/05a3e9c578e9efe8755058d1f7f1b8e71c456645/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/metadata/SelectivityEstimator.scala#L546]
 for more details.

While this approach is effective for numeric columns (e.g., {{{}INT{}}}, 
{{{}BIGINT{}}}), it shows limitations for {{VARCHAR}} type columns. Since 
character-based columns lack a native concept of "interval", the existing 
implementation falls back to returning {{defaultEqualsSelectivity}} for these 
cases.

Actually we can still use ndv(number of distinct values) to get a better guess 
of the {{VARCHAR}} columns(i.e. using 1 / ndv) for equality check. This is also 
what Spark does in its 
[planner|https://github.com/apache/spark/blob/0a102327d52712f3947cb90da671ffb18be62626/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L338].

We verified this change using a 10TB TPC-DS benchmark dataset, observing 
improvements:
 * *30% reduction* in execution time for queries #13, #48, #85
 * *10%* for queries #67, #95



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-37315) Improve SelectivityEstimator#estimateEquals for columns in VARCHAR type

Reply via email to