[
https://issues.apache.org/jira/browse/SPARK-57551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-57551:
-----------------------------
Description:
h2. What
Extend the fractional-seconds precision of the {{TIME}} data type from the
current
maximum of 6 (microseconds) to 9 (nanoseconds). After this change {{TIME(p)}}
accepts
{{0 <= p <= 9}}.
h2. Why
* Internal storage is *already* nanoseconds-since-midnight ({{Long}}),
introduced by
SPARK-52460. {{TimeType.NANOS_PRECISION = 9}} is already defined; only the cap
{{TimeType.MAX_PRECISION = 6}} prevents using it.
* ANSI SQL (ISO/IEC 9075-2, 6.1 <data type>) makes the maximum {{<time
precision>}}
implementation-defined with the sole constraint that it is *not less than 6*,
and
Syntax Rule 36 requires the maximum of {{<time precision>}} and {{<timestamp
precision>}}
to be *the same* implementation-defined value.
* This worktree already supports nanosecond timestamps via
{{TimestampNTZNanosType}} /
{{TimestampLTZNanos}} (precision 7..9). To stay ANSI-consistent, {{TIME}}
must reach
precision 9 in lockstep.
h2. Scope
* Lift {{TimeType.MAX_PRECISION}} from 6 to 9 and update precision validation in
{{TimeType}} and {{DataTypeAstBuilder}}.
* Update {{SparkDateTimeUtils.truncateTimeToPrecision}} (and its {{<=
MAX_PRECISION}}
assertion) to support p in 7..9.
* Time formatters/parsers ({{TimeFormatter}}, {{FractionTimeFormatter}}) must
format and
parse 7..9 fractional digits.
* Parquet I/O: the writer currently emits the {{TIME(MICROS)}} logical type;
emit
{{TIME(NANOS)}} for p in 7..9 and read it back ({{TimeTypeParquetOps}},
{{ParquetSchemaConverter}}, {{ParquetWriteSupport}}, {{ParquetRowConverter}},
vectorized reader).
* Verify casts already implemented for TIME (TIME(p1)->TIME(p2), TIME->DECIMAL,
TIME->integral, STRING<->TIME) behave correctly for p in 7..9.
h2. Out of scope
* Casts to/from TIMESTAMP types (tracked separately).
* TIME WITH TIME ZONE (non-goal per SPARK-51162).
h2. Acceptance criteria
* {{TIME(7)}}, {{TIME(8)}}, {{TIME(9)}} can be declared, parsed, and used as
literals.
* Round-trip through Parquet preserves nanosecond values.
* Existing TIME tests pass; new tests cover the 7..9 range.
h2. Test impact (max precision 6 -> 9)
Tests that hard-code precision 6 will need updating. The breaking ones (assert
>6 is invalid)
should be fixed in this ticket; broader 7-9 coverage is tracked by SPARK-57563.
h3. MUST-UPDATE (assert precision > 6 is invalid; will fail/flip)
* DataTypeParserSuite.scala (test "unsupported precision of the time data
type"): time(8)/time(9)
currently expect UNSUPPORTED_TIME_PRECISION -> become valid; move the invalid
case to time(10)
and add valid 7/8/9 parse cases.
* DataTypeSuite.scala (test "Parse time(n) as TimeType(n)"): extend the {{0 to
6}} loop to 0..9;
{{DataType.fromJson("time(9)")}} expects INVALID_JSON_DATA_TYPE -> must
parse; move invalid JSON
to time(10). (The {{MAX_PRECISION + 1}} invalid-range check auto-adjusts.)
* TimeExpressionsSuite.scala (CurrentTime range check, ~lines 318-327): expected
valueRange "[0, 6]" (from MICROS_PRECISION) -> "[0, 9]"; also switch the
production current_time
precision check from MICROS_PRECISION to MAX_PRECISION and add valid
current_time(7/8/9).
h3. MUST-UPDATE (enumerate 0..6 as "all precisions"; won't error but miss the
new range)
* TimeFunctionsSuiteBase.scala (current_time {{(0 to 6)}} loop).
* AvroSuite.scala / AvroFunctionsSuite.scala (precision 0-6 loops; time-micros
logical type).
* OrcQuerySuite.scala (TIME(0)..TIME(6) casts + {{0 to 6}} assert loop).
* from_/to_ function suites with testData precisions 0-6: CsvFunctionsSuite,
JsonFunctionsSuite,
CsvExpressionsSuite, JsonExpressionsSuite, XmlExpressionsSuite,
XmlFunctionsSuite.
h3. LIKELY-UPDATE (pass today; need 7-9 cases / nanosecond expectations)
* Generators capped at micros: LiteralGenerator.scala,
RandomDataGenerator.scala,
DateTimeTestUtils.localTime(..., micros).
* TimeFormatterSuite (HH:mm:ss.SSSSSS, 999999, "TIME(6)" error text) and
DateTimeUtilsSuite cast
error text.
* CastSuiteBase (TIME(p1)->TIME(p2), TIME->DECIMAL): loops auto-expand via
MAX_PRECISION but input
values are micros-only; add 7-9 fractional cases.
* Parquet/ORC/Avro micros assumptions: ParquetIOSuite, TimeTypeParquetOpsSuite
(INT64 TIME(MICROS)
-> needs TIME(NANOS) for 7-9), AvroSuite, PartitionedWriteSuite.
* Loops named via MICROS_PRECISION that mean "all valid precisions" -> switch
to MAX_PRECISION:
TimeExpressionsSuite, RowJsonSuite, DataTypeTestUtils (timeTypes),
SparkConnectPlannerSuite.
* SQL golden: sql-tests/inputs/time.sql + results/time.sql.out nanosecond-input
truncation cases
(regenerate via SQLQueryTestSuite). Tracked in SPARK-57563.
h3. INFORMATIONAL (safe if DEFAULT_PRECISION stays 6)
* ~100+ time(6)/TimeType(6) samples, current_time(6) name checks,
sql-expression-schema.md, and
PySpark tests referencing time(6). ArrowConvertersSuite is already
nanosecond-aware.
Note: this list assumes DEFAULT_PRECISION remains 6 (only MAX_PRECISION moves
to 9). Changing the
default would additionally churn the informational set.
was:
h2. What
Extend the fractional-seconds precision of the {{TIME}} data type from the
current
maximum of 6 (microseconds) to 9 (nanoseconds). After this change {{TIME(p)}}
accepts
{{0 <= p <= 9}}.
h2. Why
* Internal storage is *already* nanoseconds-since-midnight ({{Long}}),
introduced by
SPARK-52460. {{TimeType.NANOS_PRECISION = 9}} is already defined; only the cap
{{TimeType.MAX_PRECISION = 6}} prevents using it.
* ANSI SQL (ISO/IEC 9075-2, 6.1 <data type>) makes the maximum {{<time
precision>}}
implementation-defined with the sole constraint that it is *not less than 6*,
and
Syntax Rule 36 requires the maximum of {{<time precision>}} and {{<timestamp
precision>}}
to be *the same* implementation-defined value.
* This worktree already supports nanosecond timestamps via
{{TimestampNTZNanosType}} /
{{TimestampLTZNanos}} (precision 7..9). To stay ANSI-consistent, {{TIME}}
must reach
precision 9 in lockstep.
h2. Scope
* Lift {{TimeType.MAX_PRECISION}} from 6 to 9 and update precision validation in
{{TimeType}} and {{DataTypeAstBuilder}}.
* Update {{SparkDateTimeUtils.truncateTimeToPrecision}} (and its {{<=
MAX_PRECISION}}
assertion) to support p in 7..9.
* Time formatters/parsers ({{TimeFormatter}}, {{FractionTimeFormatter}}) must
format and
parse 7..9 fractional digits.
* Parquet I/O: the writer currently emits the {{TIME(MICROS)}} logical type;
emit
{{TIME(NANOS)}} for p in 7..9 and read it back ({{TimeTypeParquetOps}},
{{ParquetSchemaConverter}}, {{ParquetWriteSupport}}, {{ParquetRowConverter}},
vectorized reader).
* Verify casts already implemented for TIME (TIME(p1)->TIME(p2), TIME->DECIMAL,
TIME->integral, STRING<->TIME) behave correctly for p in 7..9.
h2. Out of scope
* Casts to/from TIMESTAMP types (tracked separately).
* TIME WITH TIME ZONE (non-goal per SPARK-51162).
h2. Acceptance criteria
* {{TIME(7)}}, {{TIME(8)}}, {{TIME(9)}} can be declared, parsed, and used as
literals.
* Round-trip through Parquet preserves nanosecond values.
* Existing TIME tests pass; new tests cover the 7..9 range.
> Extend the TIME data type precision to nanoseconds (up to 9)
> ------------------------------------------------------------
>
> Key: SPARK-57551
> URL: https://issues.apache.org/jira/browse/SPARK-57551
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Priority: Major
>
> h2. What
> Extend the fractional-seconds precision of the {{TIME}} data type from the
> current
> maximum of 6 (microseconds) to 9 (nanoseconds). After this change {{TIME(p)}}
> accepts
> {{0 <= p <= 9}}.
> h2. Why
> * Internal storage is *already* nanoseconds-since-midnight ({{Long}}),
> introduced by
> SPARK-52460. {{TimeType.NANOS_PRECISION = 9}} is already defined; only the
> cap
> {{TimeType.MAX_PRECISION = 6}} prevents using it.
> * ANSI SQL (ISO/IEC 9075-2, 6.1 <data type>) makes the maximum {{<time
> precision>}}
> implementation-defined with the sole constraint that it is *not less than
> 6*, and
> Syntax Rule 36 requires the maximum of {{<time precision>}} and
> {{<timestamp precision>}}
> to be *the same* implementation-defined value.
> * This worktree already supports nanosecond timestamps via
> {{TimestampNTZNanosType}} /
> {{TimestampLTZNanos}} (precision 7..9). To stay ANSI-consistent, {{TIME}}
> must reach
> precision 9 in lockstep.
> h2. Scope
> * Lift {{TimeType.MAX_PRECISION}} from 6 to 9 and update precision validation
> in
> {{TimeType}} and {{DataTypeAstBuilder}}.
> * Update {{SparkDateTimeUtils.truncateTimeToPrecision}} (and its {{<=
> MAX_PRECISION}}
> assertion) to support p in 7..9.
> * Time formatters/parsers ({{TimeFormatter}}, {{FractionTimeFormatter}}) must
> format and
> parse 7..9 fractional digits.
> * Parquet I/O: the writer currently emits the {{TIME(MICROS)}} logical type;
> emit
> {{TIME(NANOS)}} for p in 7..9 and read it back ({{TimeTypeParquetOps}},
> {{ParquetSchemaConverter}}, {{ParquetWriteSupport}},
> {{ParquetRowConverter}},
> vectorized reader).
> * Verify casts already implemented for TIME (TIME(p1)->TIME(p2),
> TIME->DECIMAL,
> TIME->integral, STRING<->TIME) behave correctly for p in 7..9.
> h2. Out of scope
> * Casts to/from TIMESTAMP types (tracked separately).
> * TIME WITH TIME ZONE (non-goal per SPARK-51162).
> h2. Acceptance criteria
> * {{TIME(7)}}, {{TIME(8)}}, {{TIME(9)}} can be declared, parsed, and used as
> literals.
> * Round-trip through Parquet preserves nanosecond values.
> * Existing TIME tests pass; new tests cover the 7..9 range.
> h2. Test impact (max precision 6 -> 9)
> Tests that hard-code precision 6 will need updating. The breaking ones
> (assert >6 is invalid)
> should be fixed in this ticket; broader 7-9 coverage is tracked by
> SPARK-57563.
> h3. MUST-UPDATE (assert precision > 6 is invalid; will fail/flip)
> * DataTypeParserSuite.scala (test "unsupported precision of the time data
> type"): time(8)/time(9)
> currently expect UNSUPPORTED_TIME_PRECISION -> become valid; move the
> invalid case to time(10)
> and add valid 7/8/9 parse cases.
> * DataTypeSuite.scala (test "Parse time(n) as TimeType(n)"): extend the {{0
> to 6}} loop to 0..9;
> {{DataType.fromJson("time(9)")}} expects INVALID_JSON_DATA_TYPE -> must
> parse; move invalid JSON
> to time(10). (The {{MAX_PRECISION + 1}} invalid-range check auto-adjusts.)
> * TimeExpressionsSuite.scala (CurrentTime range check, ~lines 318-327):
> expected
> valueRange "[0, 6]" (from MICROS_PRECISION) -> "[0, 9]"; also switch the
> production current_time
> precision check from MICROS_PRECISION to MAX_PRECISION and add valid
> current_time(7/8/9).
> h3. MUST-UPDATE (enumerate 0..6 as "all precisions"; won't error but miss the
> new range)
> * TimeFunctionsSuiteBase.scala (current_time {{(0 to 6)}} loop).
> * AvroSuite.scala / AvroFunctionsSuite.scala (precision 0-6 loops;
> time-micros logical type).
> * OrcQuerySuite.scala (TIME(0)..TIME(6) casts + {{0 to 6}} assert loop).
> * from_/to_ function suites with testData precisions 0-6: CsvFunctionsSuite,
> JsonFunctionsSuite,
> CsvExpressionsSuite, JsonExpressionsSuite, XmlExpressionsSuite,
> XmlFunctionsSuite.
> h3. LIKELY-UPDATE (pass today; need 7-9 cases / nanosecond expectations)
> * Generators capped at micros: LiteralGenerator.scala,
> RandomDataGenerator.scala,
> DateTimeTestUtils.localTime(..., micros).
> * TimeFormatterSuite (HH:mm:ss.SSSSSS, 999999, "TIME(6)" error text) and
> DateTimeUtilsSuite cast
> error text.
> * CastSuiteBase (TIME(p1)->TIME(p2), TIME->DECIMAL): loops auto-expand via
> MAX_PRECISION but input
> values are micros-only; add 7-9 fractional cases.
> * Parquet/ORC/Avro micros assumptions: ParquetIOSuite,
> TimeTypeParquetOpsSuite (INT64 TIME(MICROS)
> -> needs TIME(NANOS) for 7-9), AvroSuite, PartitionedWriteSuite.
> * Loops named via MICROS_PRECISION that mean "all valid precisions" -> switch
> to MAX_PRECISION:
> TimeExpressionsSuite, RowJsonSuite, DataTypeTestUtils (timeTypes),
> SparkConnectPlannerSuite.
> * SQL golden: sql-tests/inputs/time.sql + results/time.sql.out
> nanosecond-input truncation cases
> (regenerate via SQLQueryTestSuite). Tracked in SPARK-57563.
> h3. INFORMATIONAL (safe if DEFAULT_PRECISION stays 6)
> * ~100+ time(6)/TimeType(6) samples, current_time(6) name checks,
> sql-expression-schema.md, and
> PySpark tests referencing time(6). ArrowConvertersSuite is already
> nanosecond-aware.
> Note: this list assumes DEFAULT_PRECISION remains 6 (only MAX_PRECISION moves
> to 9). Changing the
> default would additionally churn the informational set.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]