[ 
https://issues.apache.org/jira/browse/SPARK-54582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-54582:
-----------------------------
    Description: 
h2. What

Support column statistics collection ({{ANALYZE TABLE ... COMPUTE STATISTICS 
FOR COLUMNS}})
for {{TIME}} columns, for all supported precisions (0..6 today, 0..9 after 
SPARK-57551).

h2. Gap (be specific)

* In-memory computation (ndv, min, max, null count, histogram) already works for
  {{DatetimeType}} via {{CommandUtils}}, and 
{{AnalyzeColumnCommand.supportsType}} accepts
  {{DatetimeType}}.
* The actual gap is *catalog persistence*: 
{{CatalogColumnStat.toExternalString}}
  ({{sql/catalyst/.../catalog/interface.scala}}) has no {{TimeType}} case and 
throws
  {{columnStatisticsSerializationNotSupportedError}} when writing min/max for 
TIME columns.

h2. Scope

* Add the {{TimeType}} branch in {{CatalogColumnStat.toExternalString}} and the
  corresponding {{fromExternalString}} parse path, serializing min/max 
consistently with
  other datetime types.
* Verify cost-based-optimizer estimation uses TIME min/max correctly.

h2. Acceptance criteria

* {{ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS time_col}} persists and 
reloads min/max
  without error, across precisions.
* Tests in the statistics suites (e.g. {{StatisticsCollectionSuite}}).

  was:
 Add support for collecting column statistics for TIME data type with all 
precision levels (TIME(0) through TIME(6)).

This helps to improve query optimization and performance estimation for tables 
containing TIME columns


> Add Time Type Statistics Collection Support
> -------------------------------------------
>
>                 Key: SPARK-54582
>                 URL: https://issues.apache.org/jira/browse/SPARK-54582
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.1.0
>            Reporter: Vinod KC
>            Priority: Major
>              Labels: pull-request-available
>
> h2. What
> Support column statistics collection ({{ANALYZE TABLE ... COMPUTE STATISTICS 
> FOR COLUMNS}})
> for {{TIME}} columns, for all supported precisions (0..6 today, 0..9 after 
> SPARK-57551).
> h2. Gap (be specific)
> * In-memory computation (ndv, min, max, null count, histogram) already works 
> for
>   {{DatetimeType}} via {{CommandUtils}}, and 
> {{AnalyzeColumnCommand.supportsType}} accepts
>   {{DatetimeType}}.
> * The actual gap is *catalog persistence*: 
> {{CatalogColumnStat.toExternalString}}
>   ({{sql/catalyst/.../catalog/interface.scala}}) has no {{TimeType}} case and 
> throws
>   {{columnStatisticsSerializationNotSupportedError}} when writing min/max for 
> TIME columns.
> h2. Scope
> * Add the {{TimeType}} branch in {{CatalogColumnStat.toExternalString}} and 
> the
>   corresponding {{fromExternalString}} parse path, serializing min/max 
> consistently with
>   other datetime types.
> * Verify cost-based-optimizer estimation uses TIME min/max correctly.
> h2. Acceptance criteria
> * {{ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS time_col}} persists and 
> reloads min/max
>   without error, across precisions.
> * Tests in the statistics suites (e.g. {{StatisticsCollectionSuite}}).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to