[
https://issues.apache.org/jira/browse/SPARK-54582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-54582:
-----------------------------
Description:
h2. What
Support column statistics collection ({{ANALYZE TABLE ... COMPUTE STATISTICS
FOR COLUMNS}})
for {{TIME}} columns, for all supported precisions (0..6 today, 0..9 after
SPARK-57551).
h2. Gap (be specific)
* In-memory computation (ndv, min, max, null count, histogram) already works for
{{DatetimeType}} via {{CommandUtils}}, and
{{AnalyzeColumnCommand.supportsType}} accepts
{{DatetimeType}}.
* The actual gap is *catalog persistence*:
{{CatalogColumnStat.toExternalString}}
({{sql/catalyst/.../catalog/interface.scala}}) has no {{TimeType}} case and
throws
{{columnStatisticsSerializationNotSupportedError}} when writing min/max for
TIME columns.
h2. Scope
* Add the {{TimeType}} branch in {{CatalogColumnStat.toExternalString}} and the
corresponding {{fromExternalString}} parse path, serializing min/max
consistently with
other datetime types.
* Verify cost-based-optimizer estimation uses TIME min/max correctly.
h2. Acceptance criteria
* {{ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS time_col}} persists and
reloads min/max
without error, across precisions.
* Tests in the statistics suites (e.g. {{StatisticsCollectionSuite}}).
was:
Add support for collecting column statistics for TIME data type with all
precision levels (TIME(0) through TIME(6)).
This helps to improve query optimization and performance estimation for tables
containing TIME columns
> Add Time Type Statistics Collection Support
> -------------------------------------------
>
> Key: SPARK-54582
> URL: https://issues.apache.org/jira/browse/SPARK-54582
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.1.0
> Reporter: Vinod KC
> Priority: Major
> Labels: pull-request-available
>
> h2. What
> Support column statistics collection ({{ANALYZE TABLE ... COMPUTE STATISTICS
> FOR COLUMNS}})
> for {{TIME}} columns, for all supported precisions (0..6 today, 0..9 after
> SPARK-57551).
> h2. Gap (be specific)
> * In-memory computation (ndv, min, max, null count, histogram) already works
> for
> {{DatetimeType}} via {{CommandUtils}}, and
> {{AnalyzeColumnCommand.supportsType}} accepts
> {{DatetimeType}}.
> * The actual gap is *catalog persistence*:
> {{CatalogColumnStat.toExternalString}}
> ({{sql/catalyst/.../catalog/interface.scala}}) has no {{TimeType}} case and
> throws
> {{columnStatisticsSerializationNotSupportedError}} when writing min/max for
> TIME columns.
> h2. Scope
> * Add the {{TimeType}} branch in {{CatalogColumnStat.toExternalString}} and
> the
> corresponding {{fromExternalString}} parse path, serializing min/max
> consistently with
> other datetime types.
> * Verify cost-based-optimizer estimation uses TIME min/max correctly.
> h2. Acceptance criteria
> * {{ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS time_col}} persists and
> reloads min/max
> without error, across precisions.
> * Tests in the statistics suites (e.g. {{StatisticsCollectionSuite}}).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]