[
https://issues.apache.org/jira/browse/SPARK-57784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092759#comment-18092759
]
Anupam Yadav commented on SPARK-57784:
--------------------------------------
I am working on this. Plan: extend cost-based optimizer column statistics
estimation to support the TIME data type, mirroring the existing handling for
the other fixed-width temporal types (DATE / TIMESTAMP / TIMESTAMP_NTZ) - i.e.
treat TIME min/max as numeric-convertible for range-based filter selectivity
and histogram/range computations - with unit tests. Will put up a PR shortly.
> Support the TIME data type in cost-based optimizer statistics estimation
> ------------------------------------------------------------------------
>
> Key: SPARK-57784
> URL: https://issues.apache.org/jira/browse/SPARK-57784
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Priority: Major
>
> h2. Background
> Umbrella: SPARK-57550 (Extend support for the TIME data type).
> SPARK-54582 fixes catalog column-statistics (de)serialization for TIME
> ({{CatalogColumnStat.toExternalString}} / {{fromExternalString}}), so
> {{ANALYZE TABLE ...
> COMPUTE STATISTICS FOR COLUMNS}} can persist TIME min/max. This ticket covers
> the
> downstream consumption of those statistics in the cost-based optimizer (CBO).
> h2. Problem
> The CBO estimation code paths (e.g. {{FilterEstimation}}, {{JoinEstimation}},
> {{EstimationUtils}}) special-case ordered datetime types such as {{DateType}}
> and
> {{TimestampType}} when estimating selectivity for range predicates and joins.
> It is not
> verified that {{TimeType}} is handled equivalently. If TIME falls through,
> range
> predicates and joins on TIME columns silently use default selectivity instead
> of
> range-based estimates -- a plan-quality gap (not a correctness bug),
> inconsistent with
> DATE/TIMESTAMP parity.
> h2. Scope
> * Audit CBO estimation utilities for datetime special-casing and ensure
> {{TimeType}}
> (a {{Long}}-backed ordered/atomic type) is treated on par with {{DateType}}
> /
> {{TimestampType}} for:
> ** range/comparison predicate selectivity,
> ** join cardinality on TIME keys,
> ** histogram-based estimation, if applicable.
> * Add estimation unit tests for TIME mirroring the DATE/TIMESTAMP cases.
> h2. Acceptance
> * Range predicates and joins on TIME columns produce range-based row-count
> estimates
> comparable to DATE/TIMESTAMP, backed by tests.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]