This is an automated email from the ASF dual-hosted git repository.
MaxGekk pushed a commit to branch branch-4.x
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-4.x by this push:
new 08572851e374 [SPARK-57032][SQL] Extend timestamp string parsing for
nanosecond fractional precision
08572851e374 is described below
commit 08572851e374880c54c7b99e7664ab508c79d0f1
Author: Maxim Gekk <[email protected]>
AuthorDate: Tue Jun 2 10:16:44 2026 +0200
[SPARK-57032][SQL] Extend timestamp string parsing for nanosecond
fractional precision
### What changes were proposed in this pull request?
This PR extends Spark's existing timestamp string parser to preserve
fractional-second digits beyond microsecond precision, and adds internal parse
entry points (in the non-public `org.apache.spark.sql.catalyst.util` package)
that produce the nanosecond-capable composite representation for
`TIMESTAMP_NTZ(p)` / `TIMESTAMP_LTZ(p)` with `p` in `[7, 9]`.
- `SparkDateTimeUtils.parseTimestampString` now retains fractional digits
7-9 in a new output-only slot `segments(9)` (the sub-microsecond remainder, a
value in `[0, 999]`). `segments(6)` continues to hold microseconds (digits
1-6), so all existing callers are unaffected. Digits beyond the 9th are
dropped. The parsing loop bound is pinned to `9` (the original number of parsed
segments) so the new slot is never written by the loop, keeping acceptance
behavior identical.
- New internal APIs (in the non-public `catalyst.util` package) returning a
normalized `org.apache.spark.unsafe.types.TimestampNanosVal` (`epochMicros` +
`nanosWithinMicro`):
- `stringToTimestampLTZNanos(s, precision, timeZoneId)` and
`stringToTimestampLTZNanosAnsi(...)`
- `stringToTimestampNTZNanos(s, precision, allowTimeZone = true)` and
`stringToTimestampNTZNanosAnsi(...)`
- The microsecond and nanosecond entry points share their parse +
`java.time` construction through two private helpers, `parseTimestampToInstant`
(LTZ family) and `parseTimestampToLocalDateTime` (NTZ family), which return the
intermediate `java.time` value carrying the full fraction (including the
sub-microsecond remainder). Each public method then keeps only its cheap,
type-specific tail inlined: `instantToMicros` / `localDateTimeToMicros` for the
microsecond path, and the shared `in [...]
- The shared helpers signal an unparseable input by returning `null` (the
callers null-check and map to `None`) rather than `Option`. This is deliberate:
`stringToTimestamp` / `stringToTimestampWithoutTimeZone` are cast hot paths
(and the nanos variants are planned to be wired into casts), so the dedup is
designed to add zero allocation - no intermediate `Option`/closure is
materialized and the small helper bodies inline into the callers, leaving the
microsecond path allocation-identi [...]
The normalization invariant (`nanosWithinMicro` in `[0, 999]`) holds for
free: the remainder is parsed as exactly the 3 sub-micro digits and
`epochMicros` comes from the independent microsecond path, so no carry is
needed; `TimestampNanosVal.fromParts` re-validates the range.
### Why are the changes needed?
The logical types `TimestampNTZNanosType` / `TimestampLTZNanosType`, the
physical value `TimestampNanosVal`, and the `TIMESTAMP_NTZ(p)` /
`TIMESTAMP_LTZ(p)` SQL syntax already exist, but string inputs with 7-9
fractional digits could not be converted to the SPIP composite representation
because the parser truncated the fractional part to microseconds. This change
provides the missing string-to-nanos parsing building block that downstream
work (cast matrix, typed SQL literals, ingest t [...]
### Does this PR introduce _any_ user-facing change?
No. Existing `TimestampType` / `TimestampNTZType` string parsing is
byte-for-byte unchanged, and the new parse APIs are internal (`catalyst.util`,
not public API) and not yet wired to user-facing casts or literals.
### How was this patch tested?
Added `TimestampNanosParseSuite` (in `sql/catalyst`) covering:
- 7/8/9-digit fractions preserved as `nanosWithinMicro`;
- per-precision truncation (e.g. `.123456789` -> `700` at p=7, `780` at
p=8, `789` at p=9), and digits beyond the 9th dropped;
- edge cases: `.0`, `.999999999`, trailing zeros, exactly 6 digits,
`.000000001`;
- NTZ vs LTZ: explicit zone offset, region-based zone, session-zone
fallback, and `allowTimeZone` / time-only rejection for NTZ;
- range corpus: Unix epoch, 1582 Julian/Gregorian cutover, year 9999, with
sub-micro fractions;
- a regression assertion pinning the unchanged microsecond results of
`stringToTimestamp` / `stringToTimestampWithoutTimeZone` through the edited
shared parser;
- ANSI variants throwing on invalid input.
Verified existing suites still pass unchanged: `DateTimeUtilsSuite`
(including the SPARK-57033 nanos roundtrip/truncation tests),
`TimestampFormatterSuite`, and the cast paths via `CastWithAnsiOnSuite`,
`CastWithAnsiOffSuite`, and `TryCastSuite`. `./dev/scalastyle` is clean.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.8)
Closes #56205 from MaxGekk/nanos-parse-string.
Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
(cherry picked from commit 1b6097051920503813c26c1b1968b8756ec5a9c3)
Signed-off-by: Max Gekk <[email protected]>
---
.../sql/catalyst/util/SparkDateTimeUtils.scala | 242 +++++++++++++++++---
.../catalyst/util/TimestampNanosParseSuite.scala | 248 +++++++++++++++++++++
2 files changed, 453 insertions(+), 37 deletions(-)
diff --git
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
index 597a96c548ce..d7200715f937 100644
---
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
+++
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
@@ -26,11 +26,11 @@ import java.util.regex.Pattern
import scala.util.control.NonFatal
-import org.apache.spark.QueryContext
+import org.apache.spark.{QueryContext, SparkException}
import org.apache.spark.sql.catalyst.util.DateTimeConstants._
import
org.apache.spark.sql.catalyst.util.RebaseDateTime.{rebaseGregorianToJulianDays,
rebaseGregorianToJulianMicros, rebaseJulianToGregorianDays,
rebaseJulianToGregorianMicros}
import org.apache.spark.sql.errors.ExecutionErrors
-import org.apache.spark.sql.types.{DateType, TimestampType, TimeType}
+import org.apache.spark.sql.types.{DateType, TimestampLTZNanosType,
TimestampNTZNanosType, TimestampType, TimeType}
import org.apache.spark.unsafe.types.{TimestampNanosVal, UTF8String}
import org.apache.spark.util.SparkClassUtils
@@ -550,10 +550,10 @@ trait SparkDateTimeUtils {
* order to distinguish between 0L and null. The following formats are
allowed:
*
* `[+-]yyyy*` `[+-]yyyy*-[m]m` `[+-]yyyy*-[m]m-[d]d` `[+-]yyyy*-[m]m-[d]d `
- * `[+-]yyyy*-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
- * `[+-]yyyy*-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
- * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
- * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
+ * `[+-]yyyy*-[m]m-[d]d
[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][ns][ns][ns][zone_id]`
+ *
`[+-]yyyy*-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][ns][ns][ns][zone_id]`
+ * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][ns][ns][ns][zone_id]`
+ * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][ns][ns][ns][zone_id]`
*
* where `zone_id` should have one of the forms:
* - Z - Zulu time zone UTC+0
@@ -567,6 +567,11 @@ trait SparkDateTimeUtils {
* - +|-hhmmss
* - Region-based zone IDs in the form `area/city`, such as `Europe/Paris`
*
+ * Up to 9 fractional-second digits are accepted. Digits 1-6 are kept as
microseconds in
+ * `segments(6)` (backward-compatible micro behavior), digits 7-9 are kept
as the
+ * sub-microsecond remainder in `segments(9)` (a value in [0, 999]), and
digits beyond the 9th
+ * are dropped.
+ *
* @return
* timestamp segments, time zone id and whether the input is just time
without a date. If the
* input string can't be parsed as timestamp, the result timestamp
segments are empty.
@@ -575,7 +580,8 @@ trait SparkDateTimeUtils {
def isValidDigits(segment: Int, digits: Int): Boolean = {
// A Long is able to represent a timestamp within [+-]200 thousand years
val maxDigitsYear = 6
- // For the nanosecond part, more than 6 digits is allowed, but will be
truncated.
+ // Fractional digits 1-6 form microseconds; digits 7-9 are retained as
the sub-microsecond
+ // remainder in segments(9); only digits beyond the 9th are dropped.
segment == 6 || (segment == 0 && digits >= 4 && digits <= maxDigitsYear)
||
// For the zoneId segment(7), it's could be zero digits when it's a
region-based zone ID
(segment == 7 && digits <= 2) ||
@@ -585,7 +591,12 @@ trait SparkDateTimeUtils {
return (Array.empty, None, false)
}
var tz: Option[String] = None
- val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0)
+ // Indices 0-6 hold year, month, day, hour, minute, second and the
microsecond part of the
+ // fractional second (digits 1-6). Index 9 is an output-only slot that
holds the
+ // sub-microsecond remainder (fractional digits 7-9) as a value in [0,
999]; it is never
+ // written by the parsing loop below. Indices 7-8 are written by the loop
as `i` advances
+ // but their values are never read by any caller.
+ val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0, 0)
var i = 0
var currentSegmentValue = 0
var currentSegmentDigits = 0
@@ -598,6 +609,7 @@ trait SparkDateTimeUtils {
}
var digitsMilli = 0
+ var nanosWithinMicro = 0
var justTime = false
var yearSign: Option[Int] = None
if (bytes(j) == '-' || bytes(j) == '+') {
@@ -680,7 +692,9 @@ trait SparkDateTimeUtils {
i += 1
}
} else {
- if (i < segments.length && (b == ':' || b == ' ')) {
+ // Bound is fixed at 9 (the original number of parsed segments) so
that the trailing
+ // output-only slot segments(9) is never written by the parsing loop.
+ if (i < 9 && (b == ':' || b == ' ')) {
if (!isValidDigits(i, currentSegmentDigits)) {
return (Array.empty, None, false)
}
@@ -696,10 +710,13 @@ trait SparkDateTimeUtils {
if (i == 6) {
digitsMilli += 1
}
- // We will truncate the nanosecond part if there are more than 6
digits, which results
- // in loss of precision
if (i != 6 || currentSegmentDigits < 6) {
+ // Fractional digits 1-6 form the microsecond part stored in
segments(6).
currentSegmentValue = currentSegmentValue * 10 + parsedValue
+ } else if (currentSegmentDigits < 9) {
+ // Fractional digits 7-9 are retained as the sub-microsecond
remainder. Digits beyond
+ // the 9th are dropped (loss of precision below the nanosecond grid).
+ nanosWithinMicro = nanosWithinMicro * 10 + parsedValue
}
currentSegmentDigits += 1
}
@@ -716,12 +733,57 @@ trait SparkDateTimeUtils {
digitsMilli += 1
}
+ // Right-pad the captured sub-microsecond digits (the 7th to 9th
fractional digits) so that
+ // segments(9) always holds a value in [0, 999]. The number of captured
digits is
+ // clamp(digitsMilli - 6, 0, 3); fewer captured digits means the remainder
is left-aligned and
+ // must be scaled up (e.g. ".0000001" -> 100, ".00000012" -> 120,
".000000123" -> 123).
+ var subMicroDigits = math.max(0, math.min(digitsMilli, 9) - 6)
+ while (subMicroDigits < 3) {
+ nanosWithinMicro *= 10
+ subMicroDigits += 1
+ }
+ segments(9) = nanosWithinMicro
+
// This step also validates time zone part
val zoneId = tz.map(zoneName => getZoneId(zoneName.trim))
segments(0) *= yearSign.getOrElse(1)
(segments, zoneId, justTime)
}
+ /**
+ * Parses a UTF8 timestamp string into the [[Instant]] it denotes, shared by
the LTZ entry
+ * points `stringToTimestamp` (micros) and `stringToTimestampLTZNanos`
(nanos). The full
+ * fractional part (including sub-microsecond digits) is carried in the
[[Instant]]; each caller
+ * then narrows to its own precision (`instantToMicros` floors the sub-micro
digits,
+ * `instantToTimestampNanos` truncates to the requested precision), so this
helper is
+ * behavior-preserving for the micro path. Callers are expected to wrap the
call in a
+ * `try`/`catch` that maps `NonFatal` to `None`.
+ *
+ * Returns `null` (rather than [[Option]]) when the string is unparseable.
The `null` sentinel
+ * keeps these cast hot paths allocation-free: no intermediate
`Option`/closure is materialized,
+ * and the small body inlines into the caller. Callers must null-check the
result.
+ */
+ private def parseTimestampToInstant(s: UTF8String, timeZoneId: ZoneId):
Instant = {
+ val (segments, parsedZoneId, justTime) = parseTimestampString(s)
+ if (segments.isEmpty) {
+ return null
+ }
+ val zoneId = parsedZoneId.getOrElse(timeZoneId)
+ // Combine the microsecond part (digits 1-6) and the sub-microsecond
remainder (digits 7-9)
+ // into a full nano-of-second. This is harmless for the micro path because
`instantToMicros`
+ // floors the sub-microsecond digits away.
+ val nanoOfSecond = (MICROSECONDS.toNanos(segments(6)) + segments(9)).toInt
+ val localTime = LocalTime.of(segments(3), segments(4), segments(5),
nanoOfSecond)
+ val localDate = if (justTime) {
+ LocalDate.now(zoneId)
+ } else {
+ LocalDate.of(segments(0), segments(1), segments(2))
+ }
+ val localDateTime = LocalDateTime.of(localDate, localTime)
+ val zonedDateTime = ZonedDateTime.of(localDateTime, zoneId)
+ Instant.from(zonedDateTime)
+ }
+
/**
* Trims and parses a given UTF8 timestamp string to the corresponding a
corresponding [[Long]]
* value. The return type is [[Option]] in order to distinguish between 0L
and null. Please
@@ -729,22 +791,9 @@ trait SparkDateTimeUtils {
*/
def stringToTimestamp(s: UTF8String, timeZoneId: ZoneId): Option[Long] = {
try {
- val (segments, parsedZoneId, justTime) = parseTimestampString(s)
- if (segments.isEmpty) {
- return None
- }
- val zoneId = parsedZoneId.getOrElse(timeZoneId)
- val nanoseconds = MICROSECONDS.toNanos(segments(6))
- val localTime = LocalTime.of(segments(3), segments(4), segments(5),
nanoseconds.toInt)
- val localDate = if (justTime) {
- LocalDate.now(zoneId)
- } else {
- LocalDate.of(segments(0), segments(1), segments(2))
- }
- val localDateTime = LocalDateTime.of(localDate, localTime)
- val zonedDateTime = ZonedDateTime.of(localDateTime, zoneId)
- val instant = Instant.from(zonedDateTime)
- Some(instantToMicros(instant))
+ // `null` here means the string was unparseable (see
`parseTimestampToInstant`).
+ val instant = parseTimestampToInstant(s, timeZoneId)
+ if (instant == null) None else Some(instantToMicros(instant))
} catch {
case NonFatal(_) => None
}
@@ -771,24 +820,143 @@ trait SparkDateTimeUtils {
* The return type is [[Option]] in order to distinguish between 0L and
null. Please refer to
* `parseTimestampString` for the allowed formats.
*/
+ /**
+ * Parses a UTF8 timestamp string into the zone-independent
[[LocalDateTime]] it denotes, shared
+ * by the NTZ entry points `stringToTimestampWithoutTimeZone` (micros) and
+ * `stringToTimestampNTZNanos` (nanos). A time zone component is discarded
when `allowTimeZone`
+ * is `true` and rejected otherwise. The full fractional part (including
sub-microsecond digits)
+ * is carried in the [[LocalDateTime]]; each caller then narrows to its own
precision
+ * (`localDateTimeToMicros` floors the sub-micro digits,
`localDateTimeToTimestampNanos`
+ * truncates to the requested precision), so this helper is
behavior-preserving for the micro
+ * path. Callers are expected to wrap the call in a `try`/`catch` that maps
`NonFatal` to
+ * `None`.
+ *
+ * Returns `null` (rather than [[Option]]) when the string is unparseable,
contains only a time
+ * part, or carries a time zone while `allowTimeZone` is `false`. The `null`
sentinel keeps
+ * these cast hot paths allocation-free: no intermediate `Option`/closure is
materialized, and
+ * the small body inlines into the caller. Callers must null-check the
result.
+ */
+ private def parseTimestampToLocalDateTime(
+ s: UTF8String,
+ allowTimeZone: Boolean): LocalDateTime = {
+ val (segments, zoneIdOpt, justTime) = parseTimestampString(s)
+ // If the input string can't be parsed as a timestamp without time zone,
or it contains only
+ // the time part of a timestamp and we can't determine its date, signal
failure with `null`.
+ if (segments.isEmpty || justTime || !allowTimeZone && zoneIdOpt.isDefined)
{
+ return null
+ }
+ // Combine the microsecond part (digits 1-6) and the sub-microsecond
remainder (digits 7-9)
+ // into a full nano-of-second. This is harmless for the micro path because
+ // `localDateTimeToMicros` floors the sub-microsecond digits away.
+ val nanoOfSecond = (MICROSECONDS.toNanos(segments(6)) + segments(9)).toInt
+ val localTime = LocalTime.of(segments(3), segments(4), segments(5),
nanoOfSecond)
+ val localDate = LocalDate.of(segments(0), segments(1), segments(2))
+ LocalDateTime.of(localDate, localTime)
+ }
+
def stringToTimestampWithoutTimeZone(s: UTF8String, allowTimeZone: Boolean):
Option[Long] = {
try {
- val (segments, zoneIdOpt, justTime) = parseTimestampString(s)
- // If the input string can't be parsed as a timestamp without time zone,
or it contains only
- // the time part of a timestamp and we can't determine its date, return
None.
- if (segments.isEmpty || justTime || !allowTimeZone &&
zoneIdOpt.isDefined) {
- return None
+ // `null` here means the string was unparseable (see
`parseTimestampToLocalDateTime`).
+ val localDateTime = parseTimestampToLocalDateTime(s, allowTimeZone)
+ if (localDateTime == null) None else
Some(localDateTimeToMicros(localDateTime))
+ } catch {
+ case NonFatal(_) => None
+ }
+ }
+
+ /**
+ * Trims and parses a given UTF8 string into a [[TimestampNanosVal]] (epoch
microseconds plus a
+ * sub-microsecond remainder in [0, 999]) for `TIMESTAMP_LTZ(precision)`
with `precision` in [7,
+ * 9]. Fractional digits beyond `precision` are truncated. The return type
is [[Option]] in
+ * order to distinguish between a valid zero value and null. Please refer to
+ * `parseTimestampString` for the allowed formats.
+ */
+ def stringToTimestampLTZNanos(
+ s: UTF8String,
+ precision: Int,
+ timeZoneId: ZoneId): Option[TimestampNanosVal] = {
+ if (precision < 7 || precision > 9) {
+ throw SparkException.internalError(
+ s"stringToTimestampLTZNanos: precision $precision is out of range [7,
9]")
+ }
+ try {
+ // `null` here means the string was unparseable (see
`parseTimestampToInstant`). The shared
+ // helper carries the full fraction in the `Instant`;
`instantToTimestampNanos` then splits
+ // it into (epochMicros, nanosWithinMicro) and applies the `precision`
truncation.
+ val instant = parseTimestampToInstant(s, timeZoneId)
+ if (instant == null) None else Some(instantToTimestampNanos(instant,
precision))
+ } catch {
+ case NonFatal(_) => None
+ }
+ }
+
+ def stringToTimestampLTZNanosAnsi(
+ s: UTF8String,
+ precision: Int,
+ timeZoneId: ZoneId,
+ context: QueryContext = null): TimestampNanosVal = {
+ stringToTimestampLTZNanos(s, precision, timeZoneId).getOrElse {
+ throw ExecutionErrors.invalidInputInCastToDatetimeError(
+ s,
+ TimestampLTZNanosType(precision),
+ context)
+ }
+ }
+
+ /**
+ * Trims and parses a given UTF8 string into a [[TimestampNanosVal]] (epoch
microseconds plus a
+ * sub-microsecond remainder in [0, 999]) for `TIMESTAMP_NTZ(precision)`
with `precision` in [7,
+ * 9]. Fractional digits beyond `precision` are truncated. The result is
independent of time
+ * zones; a time zone component is discarded when `allowTimeZone` is `true`
and rejected
+ * (returns `None`) otherwise. The return type is [[Option]] in order to
distinguish between a
+ * valid zero value and null. Please refer to `parseTimestampString` for the
allowed formats.
+ */
+ def stringToTimestampNTZNanos(
+ s: UTF8String,
+ precision: Int,
+ allowTimeZone: Boolean = true): Option[TimestampNanosVal] = {
+ if (precision < 7 || precision > 9) {
+ throw SparkException.internalError(
+ s"stringToTimestampNTZNanos: precision $precision is out of range [7,
9]")
+ }
+ try {
+ // `null` here means the string was unparseable (see
`parseTimestampToLocalDateTime`). The
+ // shared helper carries the full fraction in the `LocalDateTime`;
+ // `localDateTimeToTimestampNanos` then splits it into (epochMicros,
nanosWithinMicro) and
+ // applies the `precision` truncation.
+ val localDateTime = parseTimestampToLocalDateTime(s, allowTimeZone)
+ if (localDateTime == null) {
+ None
+ } else {
+ Some(localDateTimeToTimestampNanos(localDateTime, precision))
}
- val nanoseconds = MICROSECONDS.toNanos(segments(6))
- val localTime = LocalTime.of(segments(3), segments(4), segments(5),
nanoseconds.toInt)
- val localDate = LocalDate.of(segments(0), segments(1), segments(2))
- val localDateTime = LocalDateTime.of(localDate, localTime)
- Some(localDateTimeToMicros(localDateTime))
} catch {
case NonFatal(_) => None
}
}
+ /**
+ * ANSI variant of [[stringToTimestampNTZNanos]]. Throws
+ * [[org.apache.spark.SparkDateTimeException]] on invalid input. Uses
`allowTimeZone = true`: a
+ * time zone component in the string is silently discarded rather than
rejected. Callers that
+ * need strict NTZ rejection should call [[stringToTimestampNTZNanos]]
directly with
+ * `allowTimeZone = false`.
+ */
+ def stringToTimestampNTZNanosAnsi(
+ s: UTF8String,
+ precision: Int,
+ context: QueryContext = null): TimestampNanosVal = {
+ // TODO(SPARK-57032): when this is wired to a user-facing CAST(... AS
TIMESTAMP_NTZ(p)), the
+ // cast must decide `allowTimeZone` explicitly (per ANSI/legacy mode)
instead of relying on
+ // the `true` default used here, which silently discards a zone suffix.
+ stringToTimestampNTZNanos(s, precision).getOrElse {
+ throw ExecutionErrors.invalidInputInCastToDatetimeError(
+ s,
+ TimestampNTZNanosType(precision),
+ context)
+ }
+ }
+
/**
* Trims and parses a given UTF8 string to a corresponding [[Long]] value
which representing the
* number of microseconds since the midnight. The result will be independent
of time zones.
diff --git
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampNanosParseSuite.scala
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampNanosParseSuite.scala
new file mode 100644
index 000000000000..3a4d758da892
--- /dev/null
+++
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampNanosParseSuite.scala
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import java.time.{ZoneId, ZoneOffset}
+
+import org.apache.spark.{SparkDateTimeException, SparkException, SparkFunSuite}
+import org.apache.spark.sql.catalyst.util.DateTimeTestUtils._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils._
+import org.apache.spark.unsafe.types.{TimestampNanosVal, UTF8String}
+
+/**
+ * Tests for string-to-nanosecond timestamp parsing added under SPARK-57032.
The parser keeps the
+ * microsecond part (fractional digits 1-6) and the sub-microsecond remainder
(digits 7-9, in
+ * [0, 999]) and applies the target fractional precision `p` in [7, 9] by
truncating extra digits.
+ */
+class TimestampNanosParseSuite extends SparkFunSuite {
+
+ private val losAngeles = getZoneId("America/Los_Angeles")
+
+ private def ntz(
+ str: String,
+ precision: Int,
+ allowTimeZone: Boolean = true): Option[TimestampNanosVal] = {
+ stringToTimestampNTZNanos(UTF8String.fromString(str), precision,
allowTimeZone)
+ }
+
+ private def ltz(str: String, precision: Int, zoneId: ZoneId):
Option[TimestampNanosVal] = {
+ stringToTimestampLTZNanos(UTF8String.fromString(str), precision, zoneId)
+ }
+
+ test("NTZ: fractional digits 7-9 are preserved as nanosWithinMicro") {
+ assert(ntz("2015-01-02 00:00:00.123456789", 9).get ===
+ TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 123456,
ZoneOffset.UTC), 789.toShort))
+ assert(ntz("2015-01-02 00:00:00.1234567", 9).get ===
+ TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 123456,
ZoneOffset.UTC), 700.toShort))
+ assert(ntz("2015-01-02 00:00:00.12345678", 9).get ===
+ TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 123456,
ZoneOffset.UTC), 780.toShort))
+ }
+
+ test("NTZ: precision truncates excess sub-microsecond digits toward zero") {
+ val micros = date(2020, 12, 31, 23, 59, 59, 123456, ZoneOffset.UTC)
+ assert(ntz("2020-12-31 23:59:59.123456789", 9).get ===
+ TimestampNanosVal.fromParts(micros, 789.toShort))
+ assert(ntz("2020-12-31 23:59:59.123456789", 8).get ===
+ TimestampNanosVal.fromParts(micros, 780.toShort))
+ assert(ntz("2020-12-31 23:59:59.123456789", 7).get ===
+ TimestampNanosVal.fromParts(micros, 700.toShort))
+ }
+
+ test("NTZ: digits beyond the 9th are dropped") {
+ val expected = TimestampNanosVal.fromParts(
+ date(2020, 12, 31, 23, 59, 59, 123456, ZoneOffset.UTC), 789.toShort)
+ assert(ntz("2020-12-31 23:59:59.1234567890", 9).get === expected)
+ assert(ntz("2020-12-31 23:59:59.123456789999", 9).get === expected)
+ }
+
+ test("NTZ: fewer than 6 fractional digits yield zero nanosWithinMicro") {
+ assert(ntz("2020-01-01 00:00:00.0", 9).get ===
+ TimestampNanosVal.fromParts(date(2020, 1, 1, 0, 0, 0, 0,
ZoneOffset.UTC), 0.toShort))
+ assert(ntz("2020-01-01 00:00:00.1", 9).get ===
+ TimestampNanosVal.fromParts(date(2020, 1, 1, 0, 0, 0, 100000,
ZoneOffset.UTC), 0.toShort))
+ assert(ntz("2020-01-01 00:00:00.123456", 9).get ===
+ TimestampNanosVal.fromParts(date(2020, 1, 1, 0, 0, 0, 123456,
ZoneOffset.UTC), 0.toShort))
+ }
+
+ test("NTZ: trailing zeros in the sub-microsecond part") {
+ assert(ntz("2015-01-02 00:00:00.000050000", 9).get ===
+ TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 50,
ZoneOffset.UTC), 0.toShort))
+ assert(ntz("2015-01-02 00:00:00.100000009", 9).get ===
+ TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 100000,
ZoneOffset.UTC), 9.toShort))
+ }
+
+ test("NTZ: maximum and minimum sub-microsecond fractions") {
+ assert(ntz("2020-06-15 12:00:00.999999999", 9).get ===
+ TimestampNanosVal.fromParts(date(2020, 6, 15, 12, 0, 0, 999999,
ZoneOffset.UTC), 999.toShort))
+ assert(ntz("2020-06-15 12:00:00.000000001", 9).get ===
+ TimestampNanosVal.fromParts(date(2020, 6, 15, 12, 0, 0, 0,
ZoneOffset.UTC), 1.toShort))
+ // ".000000001" loses its only sub-micro digit at precision 8 and 7.
+ assert(ntz("2020-06-15 12:00:00.000000001", 8).get.nanosWithinMicro ===
0.toShort)
+ assert(ntz("2020-06-15 12:00:00.000000001", 7).get.nanosWithinMicro ===
0.toShort)
+ }
+
+ test("NTZ: time zone component is discarded or rejected based on
allowTimeZone") {
+ // With allowTimeZone = true (default) the zone suffix is discarded.
+ assert(ntz("2015-03-18T12:03:17.123456789Z", 9).get ===
+ TimestampNanosVal.fromParts(
+ date(2015, 3, 18, 12, 3, 17, 123456, ZoneOffset.UTC), 789.toShort))
+ // With allowTimeZone = false a zone suffix makes the input invalid.
+ assert(ntz("2015-03-18T12:03:17.123456789Z", 9, allowTimeZone =
false).isEmpty)
+ // A time-only input cannot be parsed as TIMESTAMP_NTZ.
+ assert(ntz("12:03:17.123456789", 9).isEmpty)
+ }
+
+ test("LTZ: explicit zone offset in the string") {
+ val expected = TimestampNanosVal.fromParts(
+ date(2015, 3, 18, 12, 3, 17, 123456, getZoneId("+07:00")), 789.toShort)
+ assert(ltz("2015-03-18T12:03:17.123456789+07:00", 9, ZoneOffset.UTC).get
=== expected)
+ }
+
+ test("LTZ: region-based zone in the string") {
+ val expected = TimestampNanosVal.fromParts(
+ date(2015, 3, 18, 12, 3, 17, 123456, getZoneId("Europe/Moscow")),
789.toShort)
+ assert(ltz("2015-03-18T12:03:17.123456789 Europe/Moscow", 9,
ZoneOffset.UTC).get === expected)
+ }
+
+ test("LTZ: falls back to the session zone when the string has no zone") {
+ val expected = TimestampNanosVal.fromParts(
+ date(2015, 3, 18, 12, 3, 17, 123456, losAngeles), 789.toShort)
+ assert(ltz("2015-03-18 12:03:17.123456789", 9, losAngeles).get ===
expected)
+ }
+
+ test("LTZ: precision truncation matches the NTZ path") {
+ val micros = date(2015, 3, 18, 12, 3, 17, 123456, ZoneOffset.UTC)
+ assert(ltz("2015-03-18T12:03:17.123456789Z", 7, ZoneOffset.UTC).get ===
+ TimestampNanosVal.fromParts(micros, 700.toShort))
+ assert(ltz("2015-03-18T12:03:17.123456789Z", 8, ZoneOffset.UTC).get ===
+ TimestampNanosVal.fromParts(micros, 780.toShort))
+ }
+
+ test("range edge cases with sub-microsecond fractions") {
+ // Unix epoch.
+ assert(ntz("1970-01-01 00:00:00.000000001", 9).get ===
+ TimestampNanosVal.fromParts(0L, 1.toShort))
+ // Julian/Gregorian cutover.
+ assert(ntz("1582-10-15 00:00:00.123456789", 9).get ===
+ TimestampNanosVal.fromParts(date(1582, 10, 15, 0, 0, 0, 123456,
ZoneOffset.UTC), 789.toShort))
+ // End of the supported range.
+ assert(ntz("9999-12-31 23:59:59.999999999", 9).get ===
+ TimestampNanosVal.fromParts(
+ date(9999, 12, 31, 23, 59, 59, 999999, ZoneOffset.UTC), 999.toShort))
+ }
+
+ test("null input returns None") {
+ assert(stringToTimestampNTZNanos(null, 9).isEmpty)
+ assert(stringToTimestampLTZNanos(null, 9, ZoneOffset.UTC).isEmpty)
+ }
+
+ test("invalid inputs return None") {
+ assert(ntz("not a timestamp", 9).isEmpty)
+ assert(ntz("", 9).isEmpty)
+ assert(ltz("2015-13-40 99:99:99.123456789", 9, ZoneOffset.UTC).isEmpty)
+ }
+
+ test("LTZ: time-only input uses the session zone's current date") {
+ // Time-only strings are accepted by the LTZ path (date is filled with
LocalDate.now);
+ // they are rejected by the NTZ path because the date is indeterminate.
+ val result = ltz("12:03:17.123456789", 9, ZoneOffset.UTC)
+ assert(result.isDefined)
+ assert(result.get.nanosWithinMicro === 789.toShort)
+ assert(ntz("12:03:17.123456789", 9).isEmpty)
+ }
+
+ test("pre-epoch (negative) timestamps with sub-microsecond fractions") {
+ // Exercises the yearSign path together with segments(9).
+ assert(ntz("-0001-01-01 00:00:00.000000001", 9).get ===
+ TimestampNanosVal.fromParts(
+ date(-1, 1, 1, 0, 0, 0, 0, ZoneOffset.UTC), 1.toShort))
+ assert(ntz("1582-10-14 23:59:59.999999999", 9).get ===
+ TimestampNanosVal.fromParts(
+ date(1582, 10, 14, 23, 59, 59, 999999, ZoneOffset.UTC), 999.toShort))
+ }
+
+ test("micro path through parseTimestampString is unchanged by the nanos
extension") {
+ // Regression guard for the highest-blast-radius change: growing the
segments array and
+ // pinning the parse-loop bound must not alter the microsecond results
returned by the
+ // existing stringToTimestamp / stringToTimestampWithoutTimeZone APIs. On
the micro path the
+ // sub-microsecond digits 7-9 are dropped, exactly as before this change.
+ def micros(str: String): Option[Long] =
+ stringToTimestamp(UTF8String.fromString(str), ZoneOffset.UTC)
+ def microsNtz(str: String): Option[Long] =
+ stringToTimestampWithoutTimeZone(UTF8String.fromString(str),
allowTimeZone = true)
+
+ // 9 fractional digits: still truncated to 6 (micros); digits 7-9 ignored
on the micro path.
+ assert(micros("2015-01-02 00:00:00.123456789") ===
+ Some(date(2015, 1, 2, 0, 0, 0, 123456, ZoneOffset.UTC)))
+ assert(microsNtz("2015-01-02 00:00:00.123456789") ===
+ Some(date(2015, 1, 2, 0, 0, 0, 123456, ZoneOffset.UTC)))
+ // Fewer than 6 fractional digits still right-pad to micros.
+ assert(microsNtz("2015-01-02 00:00:00.1") ===
+ Some(date(2015, 1, 2, 0, 0, 0, 100000, ZoneOffset.UTC)))
+ // Exactly 6 fractional digits are unchanged.
+ assert(microsNtz("2015-01-02 00:00:00.000456") ===
+ Some(date(2015, 1, 2, 0, 0, 0, 456, ZoneOffset.UTC)))
+ // 10+ fractional digits are still accepted and truncated to micros.
+ assert(microsNtz("2015-01-02 00:00:00.1234567890") ===
+ Some(date(2015, 1, 2, 0, 0, 0, 123456, ZoneOffset.UTC)))
+ }
+
+ test("stringToTimestampNTZNanos throws internalError for out-of-range
precision") {
+ // Precision must be in [7, 9]; anything outside is a caller bug and
should surface loudly.
+ Seq(0, 6, 10, -1).foreach { p =>
+ checkError(
+ exception = intercept[SparkException] {
+ stringToTimestampNTZNanos(
+ UTF8String.fromString("2020-01-01 00:00:00.123456789"), p)
+ },
+ condition = "INTERNAL_ERROR",
+ parameters = Map(
+ "message" -> s"stringToTimestampNTZNanos: precision $p is out of
range [7, 9]"))
+ }
+ }
+
+ test("ANSI NTZ: time zone component in the string is silently discarded") {
+ // allowTimeZone defaults to true in the ANSI variant: the zone suffix is
dropped, not
+ // rejected. Callers that need strict rejection must use
stringToTimestampNTZNanos directly
+ // with allowTimeZone = false.
+ val result = stringToTimestampNTZNanosAnsi(
+ UTF8String.fromString("2015-03-18T12:03:17.123456789Z"), 9)
+ assert(result ===
+ TimestampNanosVal.fromParts(
+ date(2015, 3, 18, 12, 3, 17, 123456, ZoneOffset.UTC), 789.toShort))
+ }
+
+ test("ANSI variants throw on invalid input") {
+ val ntzValid = stringToTimestampNTZNanosAnsi(
+ UTF8String.fromString("2015-01-02 00:00:00.123456789"), 9)
+ assert(ntzValid ===
+ TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 123456,
ZoneOffset.UTC), 789.toShort))
+
+ val ltzValid = stringToTimestampLTZNanosAnsi(
+ UTF8String.fromString("2015-01-02 00:00:00.123456789Z"), 9,
ZoneOffset.UTC)
+ assert(ltzValid ===
+ TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 123456,
ZoneOffset.UTC), 789.toShort))
+
+ intercept[SparkDateTimeException] {
+ stringToTimestampNTZNanosAnsi(UTF8String.fromString("invalid"), 9)
+ }
+ intercept[SparkDateTimeException] {
+ stringToTimestampLTZNanosAnsi(UTF8String.fromString("invalid"), 9,
ZoneOffset.UTC)
+ }
+ }
+}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]