[
https://issues.apache.org/jira/browse/SPARK-57550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-57550:
-----------------------------
Description:
Umbrella for follow-up work extending support for the TIME data type (TimeType)
introduced by the SPIP SPARK-51162. Collects TIME-related tasks such as casts
to/from other types, columnar/Arrow support, Parquet/Variant interop, and
statistics collection.
h2. Prioritization
Suggested ordering of the open sub-tasks by dependency, correctness impact,
ANSI/user value, and existing momentum (open PRs).
h3. Dependencies / blockers
* SPARK-57551 (TIME precision -> 9) is the only hard in-umbrella blocker: it
gates SPARK-57552 and SPARK-57554.
* SPARK-57552 and SPARK-57554 additionally depend on the nanosecond TIMESTAMP
types (TimestampNTZNanosType / TimestampLTZNanos), tracked outside this
umbrella.
h3. Tier 1 - do first (correctness gaps; small, no dependencies)
TIME currently throws/fails in these paths, so they behave like bugs:
* SPARK-54203 - RowToColumnConverter: TIME hits unsupportedDataTypeError in
row->column conversion (caching/vectorized paths). Best single first ticket.
* SPARK-54582 - stats serialization: CatalogColumnStat.toExternalString throws
for TIME, so ANALYZE TABLE min/max persistence is broken.
* SPARK-57559 - add a TimeType case to PhysicalDataType: trivial robustness fix.
h3. Tier 2 - high ANSI/user value, no deps, momentum (PRs exist)
* SPARK-52617 - TIME <-> TIMESTAMP_NTZ (micros): ANSI-mandatory cast, highest
everyday value (PR open).
* SPARK-54281 - numeric -> TIME: completes cast symmetry (PR open).
* SPARK-57553 - TIME <-> TIMESTAMP_LTZ (micros): finishes the ANSI cast matrix
for the common timestamp type.
* SPARK-52621 - TIME <-> VARIANT (PR open); needs the encoding decision first.
h3. Tier 3 - foundational enabler for the nanosecond line
* SPARK-57551 - precision -> 9: highest-leverage enabler; unblocks SPARK-57552
/ SPARK-57554 and aligns TIME with the in-flight nanos TIMESTAMP work and
ANSI's "TIME and TIMESTAMP share the same max precision" rule. Start early if
the nanosecond direction is a release priority.
* Then SPARK-57552 and SPARK-57554 once 57551 and the nanos TIMESTAMP types are
in.
h3. Tier 4 - valuable but independent / can run anytime
* SPARK-57555 - JDBC data source: biggest migration payoff (the SPIP
motivation), but a larger multi-dialect effort; parallelize on its own track.
* SPARK-54507 - time_bucket (PR open), SPARK-57558 - LOCALTIME (small, ANSI),
SPARK-57557 - quantile/sketch aggregates.
* SPARK-57562 - TIME benchmarks: addresses the SPIP Q6 performance-regression
risk; worth running alongside the feature work rather than deferring.
h3. Tier 5 - lower priority / niche / polish
* SPARK-53368 - Parquet isAdjustedToUTC=true (PR open, minor), SPARK-57560 -
TRY-mode arithmetic, SPARK-57556 - Hive interop (Hive has no TIME; mostly a
documented-limitation task), SPARK-51403 - ordered/atomic tests (starter), and
docs SPARK-57030 / SPARK-57031 (do last, once behavior is settled).
h3. Testing and benchmarks (cross-cutting)
Coverage to reach parity with DATE/TIMESTAMP. Feature tickets carry their own
unit tests; these track the remaining gaps:
* SPARK-57562 - TIME benchmarks (also in Tier 4; prioritize per SPIP Q6).
* SPARK-57561 - verify datetime functions reject TIME with a clear error (no
silent coercion).
* SPARK-57563 - SQL golden-file parity (try_cast, datetime-parsing/formatting,
postgreSQL/time.sql, ...).
* SPARK-57564 - catalyst/core unit-test parity (ExpressionEncoderSuite,
DDLParserSuite, DataTypeWriteCompatibilitySuite).
* SPARK-57565 - PySpark TIME test coverage.
* SPARK-57566 - Spark Connect TIME test coverage.
h3. Bottom line
* Implement first: SPARK-54203 (smallest, no deps, closes a real failure path).
* In parallel, kick off: SPARK-57551 (foundational blocker for the nanosecond
cast branch).
* Then drive to done: the ANSI cast tickets with existing PRs (SPARK-52617,
SPARK-54281).
was:
Umbrella for follow-up work extending support for the TIME data type (TimeType)
introduced by the SPIP SPARK-51162. Collects TIME-related tasks such as casts
to/from other types, columnar/Arrow support, Parquet/Variant interop, and
statistics collection.
h2. Prioritization
Suggested ordering of the open sub-tasks by dependency, correctness impact,
ANSI/user value, and existing momentum (open PRs).
h3. Dependencies / blockers
* SPARK-57551 (TIME precision -> 9) is the only hard in-umbrella blocker: it
gates SPARK-57552 and SPARK-57554.
* SPARK-57552 and SPARK-57554 additionally depend on the nanosecond TIMESTAMP
types (TimestampNTZNanosType / TimestampLTZNanos), tracked outside this
umbrella.
h3. Tier 1 - do first (correctness gaps; small, no dependencies)
TIME currently throws/fails in these paths, so they behave like bugs:
* SPARK-54203 - RowToColumnConverter: TIME hits unsupportedDataTypeError in
row->column conversion (caching/vectorized paths). Best single first ticket.
* SPARK-54582 - stats serialization: CatalogColumnStat.toExternalString throws
for TIME, so ANALYZE TABLE min/max persistence is broken.
* SPARK-57559 - add a TimeType case to PhysicalDataType: trivial robustness fix.
h3. Tier 2 - high ANSI/user value, no deps, momentum (PRs exist)
* SPARK-52617 - TIME <-> TIMESTAMP_NTZ (micros): ANSI-mandatory cast, highest
everyday value (PR open).
* SPARK-54281 - numeric -> TIME: completes cast symmetry (PR open).
* SPARK-57553 - TIME <-> TIMESTAMP_LTZ (micros): finishes the ANSI cast matrix
for the common timestamp type.
* SPARK-52621 - TIME <-> VARIANT (PR open); needs the encoding decision first.
h3. Tier 3 - foundational enabler for the nanosecond line
* SPARK-57551 - precision -> 9: highest-leverage enabler; unblocks SPARK-57552
/ SPARK-57554 and aligns TIME with the in-flight nanos TIMESTAMP work and
ANSI's "TIME and TIMESTAMP share the same max precision" rule. Start early if
the nanosecond direction is a release priority.
* Then SPARK-57552 and SPARK-57554 once 57551 and the nanos TIMESTAMP types are
in.
h3. Tier 4 - valuable but independent / can run anytime
* SPARK-57555 - JDBC data source: biggest migration payoff (the SPIP
motivation), but a larger multi-dialect effort; parallelize on its own track.
* SPARK-54507 - time_bucket (PR open), SPARK-57558 - LOCALTIME (small, ANSI),
SPARK-57557 - quantile/sketch aggregates.
h3. Tier 5 - lower priority / niche / polish
* SPARK-53368 - Parquet isAdjustedToUTC=true (PR open, minor), SPARK-57560 -
TRY-mode arithmetic, SPARK-57556 - Hive interop (Hive has no TIME; mostly a
documented-limitation task), SPARK-51403 - ordered/atomic tests (starter), and
docs SPARK-57030 / SPARK-57031 (do last, once behavior is settled).
h3. Bottom line
* Implement first: SPARK-54203 (smallest, no deps, closes a real failure path).
* In parallel, kick off: SPARK-57551 (foundational blocker for the nanosecond
cast branch).
* Then drive to done: the ANSI cast tickets with existing PRs (SPARK-52617,
SPARK-54281).
> Extend support for the TIME data type
> -------------------------------------
>
> Key: SPARK-57550
> URL: https://issues.apache.org/jira/browse/SPARK-57550
> Project: Spark
> Issue Type: Umbrella
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Assignee: Max Gekk
> Priority: Major
>
> Umbrella for follow-up work extending support for the TIME data type
> (TimeType) introduced by the SPIP SPARK-51162. Collects TIME-related tasks
> such as casts to/from other types, columnar/Arrow support, Parquet/Variant
> interop, and statistics collection.
> h2. Prioritization
> Suggested ordering of the open sub-tasks by dependency, correctness impact,
> ANSI/user value, and existing momentum (open PRs).
> h3. Dependencies / blockers
> * SPARK-57551 (TIME precision -> 9) is the only hard in-umbrella blocker: it
> gates SPARK-57552 and SPARK-57554.
> * SPARK-57552 and SPARK-57554 additionally depend on the nanosecond TIMESTAMP
> types (TimestampNTZNanosType / TimestampLTZNanos), tracked outside this
> umbrella.
> h3. Tier 1 - do first (correctness gaps; small, no dependencies)
> TIME currently throws/fails in these paths, so they behave like bugs:
> * SPARK-54203 - RowToColumnConverter: TIME hits unsupportedDataTypeError in
> row->column conversion (caching/vectorized paths). Best single first ticket.
> * SPARK-54582 - stats serialization: CatalogColumnStat.toExternalString
> throws for TIME, so ANALYZE TABLE min/max persistence is broken.
> * SPARK-57559 - add a TimeType case to PhysicalDataType: trivial robustness
> fix.
> h3. Tier 2 - high ANSI/user value, no deps, momentum (PRs exist)
> * SPARK-52617 - TIME <-> TIMESTAMP_NTZ (micros): ANSI-mandatory cast, highest
> everyday value (PR open).
> * SPARK-54281 - numeric -> TIME: completes cast symmetry (PR open).
> * SPARK-57553 - TIME <-> TIMESTAMP_LTZ (micros): finishes the ANSI cast
> matrix for the common timestamp type.
> * SPARK-52621 - TIME <-> VARIANT (PR open); needs the encoding decision first.
> h3. Tier 3 - foundational enabler for the nanosecond line
> * SPARK-57551 - precision -> 9: highest-leverage enabler; unblocks
> SPARK-57552 / SPARK-57554 and aligns TIME with the in-flight nanos TIMESTAMP
> work and ANSI's "TIME and TIMESTAMP share the same max precision" rule. Start
> early if the nanosecond direction is a release priority.
> * Then SPARK-57552 and SPARK-57554 once 57551 and the nanos TIMESTAMP types
> are in.
> h3. Tier 4 - valuable but independent / can run anytime
> * SPARK-57555 - JDBC data source: biggest migration payoff (the SPIP
> motivation), but a larger multi-dialect effort; parallelize on its own track.
> * SPARK-54507 - time_bucket (PR open), SPARK-57558 - LOCALTIME (small, ANSI),
> SPARK-57557 - quantile/sketch aggregates.
> * SPARK-57562 - TIME benchmarks: addresses the SPIP Q6 performance-regression
> risk; worth running alongside the feature work rather than deferring.
> h3. Tier 5 - lower priority / niche / polish
> * SPARK-53368 - Parquet isAdjustedToUTC=true (PR open, minor), SPARK-57560 -
> TRY-mode arithmetic, SPARK-57556 - Hive interop (Hive has no TIME; mostly a
> documented-limitation task), SPARK-51403 - ordered/atomic tests (starter),
> and docs SPARK-57030 / SPARK-57031 (do last, once behavior is settled).
> h3. Testing and benchmarks (cross-cutting)
> Coverage to reach parity with DATE/TIMESTAMP. Feature tickets carry their own
> unit tests; these track the remaining gaps:
> * SPARK-57562 - TIME benchmarks (also in Tier 4; prioritize per SPIP Q6).
> * SPARK-57561 - verify datetime functions reject TIME with a clear error (no
> silent coercion).
> * SPARK-57563 - SQL golden-file parity (try_cast,
> datetime-parsing/formatting, postgreSQL/time.sql, ...).
> * SPARK-57564 - catalyst/core unit-test parity (ExpressionEncoderSuite,
> DDLParserSuite, DataTypeWriteCompatibilitySuite).
> * SPARK-57565 - PySpark TIME test coverage.
> * SPARK-57566 - Spark Connect TIME test coverage.
> h3. Bottom line
> * Implement first: SPARK-54203 (smallest, no deps, closes a real failure
> path).
> * In parallel, kick off: SPARK-57551 (foundational blocker for the nanosecond
> cast branch).
> * Then drive to done: the ANSI cast tickets with existing PRs (SPARK-52617,
> SPARK-54281).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]