[ 
https://issues.apache.org/jira/browse/SPARK-57550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-57550:
-----------------------------
    Description: 
Umbrella for follow-up work extending support for the TIME data type (TimeType) 
introduced by the SPIP SPARK-51162. Collects TIME-related tasks such as casts 
to/from other types, columnar/Arrow support, Parquet/Variant interop, and 
statistics collection.

h2. Prioritization

Suggested ordering of the open sub-tasks by dependency, correctness impact, 
ANSI/user value, and existing momentum (open PRs).

h3. Dependencies / blockers

* SPARK-57551 (TIME precision -> 9) gates SPARK-57552 and SPARK-57554.
* SPARK-57552 and SPARK-57554 additionally depend on the nanosecond TIMESTAMP 
types (TimestampNTZNanosType / TimestampLTZNanos), tracked outside this 
umbrella.
* SPARK-54203 (RowToColumnConverter) gates the columnar follow-ons SPARK-57567, 
SPARK-57568, SPARK-57569, SPARK-57570 (each "is blocked by" it).

h3. Tier 1 - do first (correctness gaps; small, no dependencies)

TIME currently throws/fails in these paths, so they behave like bugs:
* SPARK-54203 - RowToColumnConverter: TIME hits unsupportedDataTypeError in 
row->column conversion (caching/vectorized paths). Best single first ticket.
* SPARK-54582 - stats serialization: CatalogColumnStat.toExternalString throws 
for TIME, so ANALYZE TABLE min/max persistence is broken.
* SPARK-57559 - add a TimeType case to PhysicalDataType: trivial robustness fix.

h3. Tier 2 - high ANSI/user value, no deps, momentum (PRs exist)

* SPARK-52617 - TIME <-> TIMESTAMP_NTZ (micros): ANSI-mandatory cast, highest 
everyday value (PR open).
* SPARK-54281 - numeric -> TIME: completes cast symmetry (PR open).
* SPARK-57553 - TIME <-> TIMESTAMP_LTZ (micros): finishes the ANSI cast matrix 
for the common timestamp type.
* SPARK-52621 - TIME <-> VARIANT (PR open); needs the encoding decision first.

h3. Tier 3 - foundational enabler for the nanosecond line

* SPARK-57551 - precision -> 9: highest-leverage enabler; unblocks SPARK-57552 
/ SPARK-57554 and aligns TIME with the in-flight nanos TIMESTAMP work and 
ANSI's "TIME and TIMESTAMP share the same max precision" rule. Start early if 
the nanosecond direction is a release priority.
* Then SPARK-57552 and SPARK-57554 once 57551 and the nanos TIMESTAMP types are 
in.

h3. Tier 4 - valuable but independent / can run anytime

* SPARK-57555 - JDBC data source: biggest migration payoff (the SPIP 
motivation), but a larger multi-dialect effort; parallelize on its own track.
* SPARK-54507 - time_bucket (PR open), SPARK-57558 - LOCALTIME (small, ANSI), 
SPARK-57557 - quantile/sketch aggregates.
* SPARK-57562 - TIME benchmarks: addresses the SPIP Q6 performance-regression 
risk; worth running alongside the feature work rather than deferring.

h3. Tier 5 - lower priority / niche / polish

* SPARK-53368 - Parquet isAdjustedToUTC=true (PR open, minor), SPARK-57560 - 
TRY-mode arithmetic, SPARK-57556 - Hive interop (Hive has no TIME; mostly a 
documented-limitation task), SPARK-51403 - ordered/atomic tests (starter), and 
docs SPARK-57030 / SPARK-57031 (do last, once behavior is settled).

h3. Testing and benchmarks (cross-cutting)

Coverage to reach parity with DATE/TIMESTAMP. Feature tickets carry their own 
unit tests; these track the remaining gaps:
* SPARK-57562 - TIME benchmarks (also in Tier 4; prioritize per SPIP Q6).
* SPARK-57561 - verify datetime functions reject TIME with a clear error (no 
silent coercion).
* SPARK-57563 - SQL golden-file parity (try_cast, datetime-parsing/formatting, 
postgreSQL/time.sql, ...).
* SPARK-57564 - catalyst/core unit-test parity (ExpressionEncoderSuite, 
DDLParserSuite, DataTypeWriteCompatibilitySuite).
* SPARK-57565 - PySpark TIME test coverage.
* SPARK-57566 - Spark Connect TIME test coverage.

h3. Columnar follow-ons (blocked by SPARK-54203)

These become possible only once SPARK-54203 lands; each may also need TIME 
support in its own
layer (Arrow type mapping, Parquet/ORC logical types, Variant encoding). 
Ordered by value:
* SPARK-57567 - TIME in Arrow-based Python/pandas UDF evaluation (user-facing 
PySpark value).
* SPARK-57570 - TIME in vectorized-reader column population 
(partition/missing/constant columns; read correctness).
* SPARK-57568 - TIME in Parquet/ORC aggregate push-down (performance; niche).
* SPARK-57569 - TIME in Parquet Variant shredding (niche; relates to 
SPARK-52621).

h3. Bottom line

* Implement first: SPARK-54203 (smallest, no deps, closes a real failure path).
* In parallel, kick off: SPARK-57551 (foundational blocker for the nanosecond 
cast branch).
* Then drive to done: the ANSI cast tickets with existing PRs (SPARK-52617, 
SPARK-54281).

  was:
Umbrella for follow-up work extending support for the TIME data type (TimeType) 
introduced by the SPIP SPARK-51162. Collects TIME-related tasks such as casts 
to/from other types, columnar/Arrow support, Parquet/Variant interop, and 
statistics collection.

h2. Prioritization

Suggested ordering of the open sub-tasks by dependency, correctness impact, 
ANSI/user value, and existing momentum (open PRs).

h3. Dependencies / blockers

* SPARK-57551 (TIME precision -> 9) is the only hard in-umbrella blocker: it 
gates SPARK-57552 and SPARK-57554.
* SPARK-57552 and SPARK-57554 additionally depend on the nanosecond TIMESTAMP 
types (TimestampNTZNanosType / TimestampLTZNanos), tracked outside this 
umbrella.

h3. Tier 1 - do first (correctness gaps; small, no dependencies)

TIME currently throws/fails in these paths, so they behave like bugs:
* SPARK-54203 - RowToColumnConverter: TIME hits unsupportedDataTypeError in 
row->column conversion (caching/vectorized paths). Best single first ticket.
* SPARK-54582 - stats serialization: CatalogColumnStat.toExternalString throws 
for TIME, so ANALYZE TABLE min/max persistence is broken.
* SPARK-57559 - add a TimeType case to PhysicalDataType: trivial robustness fix.

h3. Tier 2 - high ANSI/user value, no deps, momentum (PRs exist)

* SPARK-52617 - TIME <-> TIMESTAMP_NTZ (micros): ANSI-mandatory cast, highest 
everyday value (PR open).
* SPARK-54281 - numeric -> TIME: completes cast symmetry (PR open).
* SPARK-57553 - TIME <-> TIMESTAMP_LTZ (micros): finishes the ANSI cast matrix 
for the common timestamp type.
* SPARK-52621 - TIME <-> VARIANT (PR open); needs the encoding decision first.

h3. Tier 3 - foundational enabler for the nanosecond line

* SPARK-57551 - precision -> 9: highest-leverage enabler; unblocks SPARK-57552 
/ SPARK-57554 and aligns TIME with the in-flight nanos TIMESTAMP work and 
ANSI's "TIME and TIMESTAMP share the same max precision" rule. Start early if 
the nanosecond direction is a release priority.
* Then SPARK-57552 and SPARK-57554 once 57551 and the nanos TIMESTAMP types are 
in.

h3. Tier 4 - valuable but independent / can run anytime

* SPARK-57555 - JDBC data source: biggest migration payoff (the SPIP 
motivation), but a larger multi-dialect effort; parallelize on its own track.
* SPARK-54507 - time_bucket (PR open), SPARK-57558 - LOCALTIME (small, ANSI), 
SPARK-57557 - quantile/sketch aggregates.
* SPARK-57562 - TIME benchmarks: addresses the SPIP Q6 performance-regression 
risk; worth running alongside the feature work rather than deferring.

h3. Tier 5 - lower priority / niche / polish

* SPARK-53368 - Parquet isAdjustedToUTC=true (PR open, minor), SPARK-57560 - 
TRY-mode arithmetic, SPARK-57556 - Hive interop (Hive has no TIME; mostly a 
documented-limitation task), SPARK-51403 - ordered/atomic tests (starter), and 
docs SPARK-57030 / SPARK-57031 (do last, once behavior is settled).

h3. Testing and benchmarks (cross-cutting)

Coverage to reach parity with DATE/TIMESTAMP. Feature tickets carry their own 
unit tests; these track the remaining gaps:
* SPARK-57562 - TIME benchmarks (also in Tier 4; prioritize per SPIP Q6).
* SPARK-57561 - verify datetime functions reject TIME with a clear error (no 
silent coercion).
* SPARK-57563 - SQL golden-file parity (try_cast, datetime-parsing/formatting, 
postgreSQL/time.sql, ...).
* SPARK-57564 - catalyst/core unit-test parity (ExpressionEncoderSuite, 
DDLParserSuite, DataTypeWriteCompatibilitySuite).
* SPARK-57565 - PySpark TIME test coverage.
* SPARK-57566 - Spark Connect TIME test coverage.

h3. Bottom line

* Implement first: SPARK-54203 (smallest, no deps, closes a real failure path).
* In parallel, kick off: SPARK-57551 (foundational blocker for the nanosecond 
cast branch).
* Then drive to done: the ANSI cast tickets with existing PRs (SPARK-52617, 
SPARK-54281).


> Extend support for the TIME data type
> -------------------------------------
>
>                 Key: SPARK-57550
>                 URL: https://issues.apache.org/jira/browse/SPARK-57550
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>
> Umbrella for follow-up work extending support for the TIME data type 
> (TimeType) introduced by the SPIP SPARK-51162. Collects TIME-related tasks 
> such as casts to/from other types, columnar/Arrow support, Parquet/Variant 
> interop, and statistics collection.
> h2. Prioritization
> Suggested ordering of the open sub-tasks by dependency, correctness impact, 
> ANSI/user value, and existing momentum (open PRs).
> h3. Dependencies / blockers
> * SPARK-57551 (TIME precision -> 9) gates SPARK-57552 and SPARK-57554.
> * SPARK-57552 and SPARK-57554 additionally depend on the nanosecond TIMESTAMP 
> types (TimestampNTZNanosType / TimestampLTZNanos), tracked outside this 
> umbrella.
> * SPARK-54203 (RowToColumnConverter) gates the columnar follow-ons 
> SPARK-57567, SPARK-57568, SPARK-57569, SPARK-57570 (each "is blocked by" it).
> h3. Tier 1 - do first (correctness gaps; small, no dependencies)
> TIME currently throws/fails in these paths, so they behave like bugs:
> * SPARK-54203 - RowToColumnConverter: TIME hits unsupportedDataTypeError in 
> row->column conversion (caching/vectorized paths). Best single first ticket.
> * SPARK-54582 - stats serialization: CatalogColumnStat.toExternalString 
> throws for TIME, so ANALYZE TABLE min/max persistence is broken.
> * SPARK-57559 - add a TimeType case to PhysicalDataType: trivial robustness 
> fix.
> h3. Tier 2 - high ANSI/user value, no deps, momentum (PRs exist)
> * SPARK-52617 - TIME <-> TIMESTAMP_NTZ (micros): ANSI-mandatory cast, highest 
> everyday value (PR open).
> * SPARK-54281 - numeric -> TIME: completes cast symmetry (PR open).
> * SPARK-57553 - TIME <-> TIMESTAMP_LTZ (micros): finishes the ANSI cast 
> matrix for the common timestamp type.
> * SPARK-52621 - TIME <-> VARIANT (PR open); needs the encoding decision first.
> h3. Tier 3 - foundational enabler for the nanosecond line
> * SPARK-57551 - precision -> 9: highest-leverage enabler; unblocks 
> SPARK-57552 / SPARK-57554 and aligns TIME with the in-flight nanos TIMESTAMP 
> work and ANSI's "TIME and TIMESTAMP share the same max precision" rule. Start 
> early if the nanosecond direction is a release priority.
> * Then SPARK-57552 and SPARK-57554 once 57551 and the nanos TIMESTAMP types 
> are in.
> h3. Tier 4 - valuable but independent / can run anytime
> * SPARK-57555 - JDBC data source: biggest migration payoff (the SPIP 
> motivation), but a larger multi-dialect effort; parallelize on its own track.
> * SPARK-54507 - time_bucket (PR open), SPARK-57558 - LOCALTIME (small, ANSI), 
> SPARK-57557 - quantile/sketch aggregates.
> * SPARK-57562 - TIME benchmarks: addresses the SPIP Q6 performance-regression 
> risk; worth running alongside the feature work rather than deferring.
> h3. Tier 5 - lower priority / niche / polish
> * SPARK-53368 - Parquet isAdjustedToUTC=true (PR open, minor), SPARK-57560 - 
> TRY-mode arithmetic, SPARK-57556 - Hive interop (Hive has no TIME; mostly a 
> documented-limitation task), SPARK-51403 - ordered/atomic tests (starter), 
> and docs SPARK-57030 / SPARK-57031 (do last, once behavior is settled).
> h3. Testing and benchmarks (cross-cutting)
> Coverage to reach parity with DATE/TIMESTAMP. Feature tickets carry their own 
> unit tests; these track the remaining gaps:
> * SPARK-57562 - TIME benchmarks (also in Tier 4; prioritize per SPIP Q6).
> * SPARK-57561 - verify datetime functions reject TIME with a clear error (no 
> silent coercion).
> * SPARK-57563 - SQL golden-file parity (try_cast, 
> datetime-parsing/formatting, postgreSQL/time.sql, ...).
> * SPARK-57564 - catalyst/core unit-test parity (ExpressionEncoderSuite, 
> DDLParserSuite, DataTypeWriteCompatibilitySuite).
> * SPARK-57565 - PySpark TIME test coverage.
> * SPARK-57566 - Spark Connect TIME test coverage.
> h3. Columnar follow-ons (blocked by SPARK-54203)
> These become possible only once SPARK-54203 lands; each may also need TIME 
> support in its own
> layer (Arrow type mapping, Parquet/ORC logical types, Variant encoding). 
> Ordered by value:
> * SPARK-57567 - TIME in Arrow-based Python/pandas UDF evaluation (user-facing 
> PySpark value).
> * SPARK-57570 - TIME in vectorized-reader column population 
> (partition/missing/constant columns; read correctness).
> * SPARK-57568 - TIME in Parquet/ORC aggregate push-down (performance; niche).
> * SPARK-57569 - TIME in Parquet Variant shredding (niche; relates to 
> SPARK-52621).
> h3. Bottom line
> * Implement first: SPARK-54203 (smallest, no deps, closes a real failure 
> path).
> * In parallel, kick off: SPARK-57551 (foundational blocker for the nanosecond 
> cast branch).
> * Then drive to done: the ANSI cast tickets with existing PRs (SPARK-52617, 
> SPARK-54281).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to