[ 
https://issues.apache.org/jira/browse/SPARK-57784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-57784:
-----------------------------------
    Labels: pull-request-available  (was: )

> Support the TIME data type in cost-based optimizer statistics estimation
> ------------------------------------------------------------------------
>
>                 Key: SPARK-57784
>                 URL: https://issues.apache.org/jira/browse/SPARK-57784
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Max Gekk
>            Priority: Major
>              Labels: pull-request-available
>
> h2. Background
> Umbrella: SPARK-57550 (Extend support for the TIME data type).
> SPARK-54582 fixes catalog column-statistics (de)serialization for TIME
> ({{CatalogColumnStat.toExternalString}} / {{fromExternalString}}), so 
> {{ANALYZE TABLE ...
> COMPUTE STATISTICS FOR COLUMNS}} can persist TIME min/max. This ticket covers 
> the
> downstream consumption of those statistics in the cost-based optimizer (CBO).
> h2. Problem
> The CBO estimation code paths (e.g. {{FilterEstimation}}, {{JoinEstimation}},
> {{EstimationUtils}}) special-case ordered datetime types such as {{DateType}} 
> and
> {{TimestampType}} when estimating selectivity for range predicates and joins. 
> It is not
> verified that {{TimeType}} is handled equivalently. If TIME falls through, 
> range
> predicates and joins on TIME columns silently use default selectivity instead 
> of
> range-based estimates -- a plan-quality gap (not a correctness bug), 
> inconsistent with
> DATE/TIMESTAMP parity.
> h2. Scope
> * Audit CBO estimation utilities for datetime special-casing and ensure 
> {{TimeType}}
>   (a {{Long}}-backed ordered/atomic type) is treated on par with {{DateType}} 
> /
>   {{TimestampType}} for:
> ** range/comparison predicate selectivity,
> ** join cardinality on TIME keys,
> ** histogram-based estimation, if applicable.
> * Add estimation unit tests for TIME mirroring the DATE/TIMESTAMP cases.
> h2. Acceptance
> * Range predicates and joins on TIME columns produce range-based row-count 
> estimates
>   comparable to DATE/TIMESTAMP, backed by tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to