[
https://issues.apache.org/jira/browse/SPARK-57784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57784:
-----------------------------------
Labels: pull-request-available (was: )
> Support the TIME data type in cost-based optimizer statistics estimation
> ------------------------------------------------------------------------
>
> Key: SPARK-57784
> URL: https://issues.apache.org/jira/browse/SPARK-57784
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Priority: Major
> Labels: pull-request-available
>
> h2. Background
> Umbrella: SPARK-57550 (Extend support for the TIME data type).
> SPARK-54582 fixes catalog column-statistics (de)serialization for TIME
> ({{CatalogColumnStat.toExternalString}} / {{fromExternalString}}), so
> {{ANALYZE TABLE ...
> COMPUTE STATISTICS FOR COLUMNS}} can persist TIME min/max. This ticket covers
> the
> downstream consumption of those statistics in the cost-based optimizer (CBO).
> h2. Problem
> The CBO estimation code paths (e.g. {{FilterEstimation}}, {{JoinEstimation}},
> {{EstimationUtils}}) special-case ordered datetime types such as {{DateType}}
> and
> {{TimestampType}} when estimating selectivity for range predicates and joins.
> It is not
> verified that {{TimeType}} is handled equivalently. If TIME falls through,
> range
> predicates and joins on TIME columns silently use default selectivity instead
> of
> range-based estimates -- a plan-quality gap (not a correctness bug),
> inconsistent with
> DATE/TIMESTAMP parity.
> h2. Scope
> * Audit CBO estimation utilities for datetime special-casing and ensure
> {{TimeType}}
> (a {{Long}}-backed ordered/atomic type) is treated on par with {{DateType}}
> /
> {{TimestampType}} for:
> ** range/comparison predicate selectivity,
> ** join cardinality on TIME keys,
> ** histogram-based estimation, if applicable.
> * Add estimation unit tests for TIME mirroring the DATE/TIMESTAMP cases.
> h2. Acceptance
> * Range predicates and joins on TIME columns produce range-based row-count
> estimates
> comparable to DATE/TIMESTAMP, backed by tests.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]