(spark) branch branch-4.x updated: [SPARK-57032][SQL] Extend timestamp string parsing for nanosecond fractional precision

maxgekk Tue, 02 Jun 2026 01:17:30 -0700

This is an automated email from the ASF dual-hosted git repository.

MaxGekk pushed a commit to branch branch-4.x
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-4.x by this push:
     new 08572851e374 [SPARK-57032][SQL] Extend timestamp string parsing for 
nanosecond fractional precision
08572851e374 is described below

commit 08572851e374880c54c7b99e7664ab508c79d0f1
Author: Maxim Gekk <[email protected]>
AuthorDate: Tue Jun 2 10:16:44 2026 +0200

    [SPARK-57032][SQL] Extend timestamp string parsing for nanosecond 
fractional precision
    
    ### What changes were proposed in this pull request?
    
    This PR extends Spark's existing timestamp string parser to preserve 
fractional-second digits beyond microsecond precision, and adds internal parse 
entry points (in the non-public `org.apache.spark.sql.catalyst.util` package) 
that produce the nanosecond-capable composite representation for 
`TIMESTAMP_NTZ(p)` / `TIMESTAMP_LTZ(p)` with `p` in `[7, 9]`.
    
    - `SparkDateTimeUtils.parseTimestampString` now retains fractional digits 
7-9 in a new output-only slot `segments(9)` (the sub-microsecond remainder, a 
value in `[0, 999]`). `segments(6)` continues to hold microseconds (digits 
1-6), so all existing callers are unaffected. Digits beyond the 9th are 
dropped. The parsing loop bound is pinned to `9` (the original number of parsed 
segments) so the new slot is never written by the loop, keeping acceptance 
behavior identical.
    - New internal APIs (in the non-public `catalyst.util` package) returning a 
normalized `org.apache.spark.unsafe.types.TimestampNanosVal` (`epochMicros` + 
`nanosWithinMicro`):
      - `stringToTimestampLTZNanos(s, precision, timeZoneId)` and 
`stringToTimestampLTZNanosAnsi(...)`
      - `stringToTimestampNTZNanos(s, precision, allowTimeZone = true)` and 
`stringToTimestampNTZNanosAnsi(...)`
    - The microsecond and nanosecond entry points share their parse + 
`java.time` construction through two private helpers, `parseTimestampToInstant` 
(LTZ family) and `parseTimestampToLocalDateTime` (NTZ family), which return the 
intermediate `java.time` value carrying the full fraction (including the 
sub-microsecond remainder). Each public method then keeps only its cheap, 
type-specific tail inlined: `instantToMicros` / `localDateTimeToMicros` for the 
microsecond path, and the shared `in [...]
    - The shared helpers signal an unparseable input by returning `null` (the 
callers null-check and map to `None`) rather than `Option`. This is deliberate: 
`stringToTimestamp` / `stringToTimestampWithoutTimeZone` are cast hot paths 
(and the nanos variants are planned to be wired into casts), so the dedup is 
designed to add zero allocation - no intermediate `Option`/closure is 
materialized and the small helper bodies inline into the callers, leaving the 
microsecond path allocation-identi [...]
    
    The normalization invariant (`nanosWithinMicro` in `[0, 999]`) holds for 
free: the remainder is parsed as exactly the 3 sub-micro digits and 
`epochMicros` comes from the independent microsecond path, so no carry is 
needed; `TimestampNanosVal.fromParts` re-validates the range.
    
    ### Why are the changes needed?
    
    The logical types `TimestampNTZNanosType` / `TimestampLTZNanosType`, the 
physical value `TimestampNanosVal`, and the `TIMESTAMP_NTZ(p)` / 
`TIMESTAMP_LTZ(p)` SQL syntax already exist, but string inputs with 7-9 
fractional digits could not be converted to the SPIP composite representation 
because the parser truncated the fractional part to microseconds. This change 
provides the missing string-to-nanos parsing building block that downstream 
work (cast matrix, typed SQL literals, ingest t [...]
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. Existing `TimestampType` / `TimestampNTZType` string parsing is 
byte-for-byte unchanged, and the new parse APIs are internal (`catalyst.util`, 
not public API) and not yet wired to user-facing casts or literals.
    
    ### How was this patch tested?
    
    Added `TimestampNanosParseSuite` (in `sql/catalyst`) covering:
    - 7/8/9-digit fractions preserved as `nanosWithinMicro`;
    - per-precision truncation (e.g. `.123456789` -> `700` at p=7, `780` at 
p=8, `789` at p=9), and digits beyond the 9th dropped;
    - edge cases: `.0`, `.999999999`, trailing zeros, exactly 6 digits, 
`.000000001`;
    - NTZ vs LTZ: explicit zone offset, region-based zone, session-zone 
fallback, and `allowTimeZone` / time-only rejection for NTZ;
    - range corpus: Unix epoch, 1582 Julian/Gregorian cutover, year 9999, with 
sub-micro fractions;
    - a regression assertion pinning the unchanged microsecond results of 
`stringToTimestamp` / `stringToTimestampWithoutTimeZone` through the edited 
shared parser;
    - ANSI variants throwing on invalid input.
    
    Verified existing suites still pass unchanged: `DateTimeUtilsSuite` 
(including the SPARK-57033 nanos roundtrip/truncation tests), 
`TimestampFormatterSuite`, and the cast paths via `CastWithAnsiOnSuite`, 
`CastWithAnsiOffSuite`, and `TryCastSuite`. `./dev/scalastyle` is clean.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Generated-by: Cursor (Claude Opus 4.8)
    
    Closes #56205 from MaxGekk/nanos-parse-string.
    
    Authored-by: Maxim Gekk <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    (cherry picked from commit 1b6097051920503813c26c1b1968b8756ec5a9c3)
    Signed-off-by: Max Gekk <[email protected]>
---
 .../sql/catalyst/util/SparkDateTimeUtils.scala     | 242 +++++++++++++++++---
 .../catalyst/util/TimestampNanosParseSuite.scala   | 248 +++++++++++++++++++++
 2 files changed, 453 insertions(+), 37 deletions(-)

diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
index 597a96c548ce..d7200715f937 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
@@ -26,11 +26,11 @@ import java.util.regex.Pattern
 
 import scala.util.control.NonFatal
 
-import org.apache.spark.QueryContext
+import org.apache.spark.{QueryContext, SparkException}
 import org.apache.spark.sql.catalyst.util.DateTimeConstants._
 import 
org.apache.spark.sql.catalyst.util.RebaseDateTime.{rebaseGregorianToJulianDays, 
rebaseGregorianToJulianMicros, rebaseJulianToGregorianDays, 
rebaseJulianToGregorianMicros}
 import org.apache.spark.sql.errors.ExecutionErrors
-import org.apache.spark.sql.types.{DateType, TimestampType, TimeType}
+import org.apache.spark.sql.types.{DateType, TimestampLTZNanosType, 
TimestampNTZNanosType, TimestampType, TimeType}
 import org.apache.spark.unsafe.types.{TimestampNanosVal, UTF8String}
 import org.apache.spark.util.SparkClassUtils
 
@@ -550,10 +550,10 @@ trait SparkDateTimeUtils {
    * order to distinguish between 0L and null. The following formats are 
allowed:
    *
    * `[+-]yyyy*` `[+-]yyyy*-[m]m` `[+-]yyyy*-[m]m-[d]d` `[+-]yyyy*-[m]m-[d]d `
-   * `[+-]yyyy*-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
-   * `[+-]yyyy*-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
-   * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
-   * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
+   * `[+-]yyyy*-[m]m-[d]d 
[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][ns][ns][ns][zone_id]`
+   * 
`[+-]yyyy*-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][ns][ns][ns][zone_id]`
+   * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][ns][ns][ns][zone_id]`
+   * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][ns][ns][ns][zone_id]`
    *
    * where `zone_id` should have one of the forms:
    *   - Z - Zulu time zone UTC+0
@@ -567,6 +567,11 @@ trait SparkDateTimeUtils {
    *     - +|-hhmmss
    *   - Region-based zone IDs in the form `area/city`, such as `Europe/Paris`
    *
+   * Up to 9 fractional-second digits are accepted. Digits 1-6 are kept as 
microseconds in
+   * `segments(6)` (backward-compatible micro behavior), digits 7-9 are kept 
as the
+   * sub-microsecond remainder in `segments(9)` (a value in [0, 999]), and 
digits beyond the 9th
+   * are dropped.
+   *
    * @return
    *   timestamp segments, time zone id and whether the input is just time 
without a date. If the
    *   input string can't be parsed as timestamp, the result timestamp 
segments are empty.
@@ -575,7 +580,8 @@ trait SparkDateTimeUtils {
     def isValidDigits(segment: Int, digits: Int): Boolean = {
       // A Long is able to represent a timestamp within [+-]200 thousand years
       val maxDigitsYear = 6
-      // For the nanosecond part, more than 6 digits is allowed, but will be 
truncated.
+      // Fractional digits 1-6 form microseconds; digits 7-9 are retained as 
the sub-microsecond
+      // remainder in segments(9); only digits beyond the 9th are dropped.
       segment == 6 || (segment == 0 && digits >= 4 && digits <= maxDigitsYear) 
||
       // For the zoneId segment(7), it's could be zero digits when it's a 
region-based zone ID
       (segment == 7 && digits <= 2) ||
@@ -585,7 +591,12 @@ trait SparkDateTimeUtils {
       return (Array.empty, None, false)
     }
     var tz: Option[String] = None
-    val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0)
+    // Indices 0-6 hold year, month, day, hour, minute, second and the 
microsecond part of the
+    // fractional second (digits 1-6). Index 9 is an output-only slot that 
holds the
+    // sub-microsecond remainder (fractional digits 7-9) as a value in [0, 
999]; it is never
+    // written by the parsing loop below. Indices 7-8 are written by the loop 
as `i` advances
+    // but their values are never read by any caller.
+    val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0, 0)
     var i = 0
     var currentSegmentValue = 0
     var currentSegmentDigits = 0
@@ -598,6 +609,7 @@ trait SparkDateTimeUtils {
     }
 
     var digitsMilli = 0
+    var nanosWithinMicro = 0
     var justTime = false
     var yearSign: Option[Int] = None
     if (bytes(j) == '-' || bytes(j) == '+') {
@@ -680,7 +692,9 @@ trait SparkDateTimeUtils {
             i += 1
           }
         } else {
-          if (i < segments.length && (b == ':' || b == ' ')) {
+          // Bound is fixed at 9 (the original number of parsed segments) so 
that the trailing
+          // output-only slot segments(9) is never written by the parsing loop.
+          if (i < 9 && (b == ':' || b == ' ')) {
             if (!isValidDigits(i, currentSegmentDigits)) {
               return (Array.empty, None, false)
             }
@@ -696,10 +710,13 @@ trait SparkDateTimeUtils {
         if (i == 6) {
           digitsMilli += 1
         }
-        // We will truncate the nanosecond part if there are more than 6 
digits, which results
-        // in loss of precision
         if (i != 6 || currentSegmentDigits < 6) {
+          // Fractional digits 1-6 form the microsecond part stored in 
segments(6).
           currentSegmentValue = currentSegmentValue * 10 + parsedValue
+        } else if (currentSegmentDigits < 9) {
+          // Fractional digits 7-9 are retained as the sub-microsecond 
remainder. Digits beyond
+          // the 9th are dropped (loss of precision below the nanosecond grid).
+          nanosWithinMicro = nanosWithinMicro * 10 + parsedValue
         }
         currentSegmentDigits += 1
       }
@@ -716,12 +733,57 @@ trait SparkDateTimeUtils {
       digitsMilli += 1
     }
 
+    // Right-pad the captured sub-microsecond digits (the 7th to 9th 
fractional digits) so that
+    // segments(9) always holds a value in [0, 999]. The number of captured 
digits is
+    // clamp(digitsMilli - 6, 0, 3); fewer captured digits means the remainder 
is left-aligned and
+    // must be scaled up (e.g. ".0000001" -> 100, ".00000012" -> 120, 
".000000123" -> 123).
+    var subMicroDigits = math.max(0, math.min(digitsMilli, 9) - 6)
+    while (subMicroDigits < 3) {
+      nanosWithinMicro *= 10
+      subMicroDigits += 1
+    }
+    segments(9) = nanosWithinMicro
+
     // This step also validates time zone part
     val zoneId = tz.map(zoneName => getZoneId(zoneName.trim))
     segments(0) *= yearSign.getOrElse(1)
     (segments, zoneId, justTime)
   }
 
+  /**
+   * Parses a UTF8 timestamp string into the [[Instant]] it denotes, shared by 
the LTZ entry
+   * points `stringToTimestamp` (micros) and `stringToTimestampLTZNanos` 
(nanos). The full
+   * fractional part (including sub-microsecond digits) is carried in the 
[[Instant]]; each caller
+   * then narrows to its own precision (`instantToMicros` floors the sub-micro 
digits,
+   * `instantToTimestampNanos` truncates to the requested precision), so this 
helper is
+   * behavior-preserving for the micro path. Callers are expected to wrap the 
call in a
+   * `try`/`catch` that maps `NonFatal` to `None`.
+   *
+   * Returns `null` (rather than [[Option]]) when the string is unparseable. 
The `null` sentinel
+   * keeps these cast hot paths allocation-free: no intermediate 
`Option`/closure is materialized,
+   * and the small body inlines into the caller. Callers must null-check the 
result.
+   */
+  private def parseTimestampToInstant(s: UTF8String, timeZoneId: ZoneId): 
Instant = {
+    val (segments, parsedZoneId, justTime) = parseTimestampString(s)
+    if (segments.isEmpty) {
+      return null
+    }
+    val zoneId = parsedZoneId.getOrElse(timeZoneId)
+    // Combine the microsecond part (digits 1-6) and the sub-microsecond 
remainder (digits 7-9)
+    // into a full nano-of-second. This is harmless for the micro path because 
`instantToMicros`
+    // floors the sub-microsecond digits away.
+    val nanoOfSecond = (MICROSECONDS.toNanos(segments(6)) + segments(9)).toInt
+    val localTime = LocalTime.of(segments(3), segments(4), segments(5), 
nanoOfSecond)
+    val localDate = if (justTime) {
+      LocalDate.now(zoneId)
+    } else {
+      LocalDate.of(segments(0), segments(1), segments(2))
+    }
+    val localDateTime = LocalDateTime.of(localDate, localTime)
+    val zonedDateTime = ZonedDateTime.of(localDateTime, zoneId)
+    Instant.from(zonedDateTime)
+  }
+
   /**
    * Trims and parses a given UTF8 timestamp string to the corresponding a 
corresponding [[Long]]
    * value. The return type is [[Option]] in order to distinguish between 0L 
and null. Please
@@ -729,22 +791,9 @@ trait SparkDateTimeUtils {
    */
   def stringToTimestamp(s: UTF8String, timeZoneId: ZoneId): Option[Long] = {
     try {
-      val (segments, parsedZoneId, justTime) = parseTimestampString(s)
-      if (segments.isEmpty) {
-        return None
-      }
-      val zoneId = parsedZoneId.getOrElse(timeZoneId)
-      val nanoseconds = MICROSECONDS.toNanos(segments(6))
-      val localTime = LocalTime.of(segments(3), segments(4), segments(5), 
nanoseconds.toInt)
-      val localDate = if (justTime) {
-        LocalDate.now(zoneId)
-      } else {
-        LocalDate.of(segments(0), segments(1), segments(2))
-      }
-      val localDateTime = LocalDateTime.of(localDate, localTime)
-      val zonedDateTime = ZonedDateTime.of(localDateTime, zoneId)
-      val instant = Instant.from(zonedDateTime)
-      Some(instantToMicros(instant))
+      // `null` here means the string was unparseable (see 
`parseTimestampToInstant`).
+      val instant = parseTimestampToInstant(s, timeZoneId)
+      if (instant == null) None else Some(instantToMicros(instant))
     } catch {
       case NonFatal(_) => None
     }
@@ -771,24 +820,143 @@ trait SparkDateTimeUtils {
    * The return type is [[Option]] in order to distinguish between 0L and 
null. Please refer to
    * `parseTimestampString` for the allowed formats.
    */
+  /**
+   * Parses a UTF8 timestamp string into the zone-independent 
[[LocalDateTime]] it denotes, shared
+   * by the NTZ entry points `stringToTimestampWithoutTimeZone` (micros) and
+   * `stringToTimestampNTZNanos` (nanos). A time zone component is discarded 
when `allowTimeZone`
+   * is `true` and rejected otherwise. The full fractional part (including 
sub-microsecond digits)
+   * is carried in the [[LocalDateTime]]; each caller then narrows to its own 
precision
+   * (`localDateTimeToMicros` floors the sub-micro digits, 
`localDateTimeToTimestampNanos`
+   * truncates to the requested precision), so this helper is 
behavior-preserving for the micro
+   * path. Callers are expected to wrap the call in a `try`/`catch` that maps 
`NonFatal` to
+   * `None`.
+   *
+   * Returns `null` (rather than [[Option]]) when the string is unparseable, 
contains only a time
+   * part, or carries a time zone while `allowTimeZone` is `false`. The `null` 
sentinel keeps
+   * these cast hot paths allocation-free: no intermediate `Option`/closure is 
materialized, and
+   * the small body inlines into the caller. Callers must null-check the 
result.
+   */
+  private def parseTimestampToLocalDateTime(
+      s: UTF8String,
+      allowTimeZone: Boolean): LocalDateTime = {
+    val (segments, zoneIdOpt, justTime) = parseTimestampString(s)
+    // If the input string can't be parsed as a timestamp without time zone, 
or it contains only
+    // the time part of a timestamp and we can't determine its date, signal 
failure with `null`.
+    if (segments.isEmpty || justTime || !allowTimeZone && zoneIdOpt.isDefined) 
{
+      return null
+    }
+    // Combine the microsecond part (digits 1-6) and the sub-microsecond 
remainder (digits 7-9)
+    // into a full nano-of-second. This is harmless for the micro path because
+    // `localDateTimeToMicros` floors the sub-microsecond digits away.
+    val nanoOfSecond = (MICROSECONDS.toNanos(segments(6)) + segments(9)).toInt
+    val localTime = LocalTime.of(segments(3), segments(4), segments(5), 
nanoOfSecond)
+    val localDate = LocalDate.of(segments(0), segments(1), segments(2))
+    LocalDateTime.of(localDate, localTime)
+  }
+
   def stringToTimestampWithoutTimeZone(s: UTF8String, allowTimeZone: Boolean): 
Option[Long] = {
     try {
-      val (segments, zoneIdOpt, justTime) = parseTimestampString(s)
-      // If the input string can't be parsed as a timestamp without time zone, 
or it contains only
-      // the time part of a timestamp and we can't determine its date, return 
None.
-      if (segments.isEmpty || justTime || !allowTimeZone && 
zoneIdOpt.isDefined) {
-        return None
+      // `null` here means the string was unparseable (see 
`parseTimestampToLocalDateTime`).
+      val localDateTime = parseTimestampToLocalDateTime(s, allowTimeZone)
+      if (localDateTime == null) None else 
Some(localDateTimeToMicros(localDateTime))
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  /**
+   * Trims and parses a given UTF8 string into a [[TimestampNanosVal]] (epoch 
microseconds plus a
+   * sub-microsecond remainder in [0, 999]) for `TIMESTAMP_LTZ(precision)` 
with `precision` in [7,
+   * 9]. Fractional digits beyond `precision` are truncated. The return type 
is [[Option]] in
+   * order to distinguish between a valid zero value and null. Please refer to
+   * `parseTimestampString` for the allowed formats.
+   */
+  def stringToTimestampLTZNanos(
+      s: UTF8String,
+      precision: Int,
+      timeZoneId: ZoneId): Option[TimestampNanosVal] = {
+    if (precision < 7 || precision > 9) {
+      throw SparkException.internalError(
+        s"stringToTimestampLTZNanos: precision $precision is out of range [7, 
9]")
+    }
+    try {
+      // `null` here means the string was unparseable (see 
`parseTimestampToInstant`). The shared
+      // helper carries the full fraction in the `Instant`; 
`instantToTimestampNanos` then splits
+      // it into (epochMicros, nanosWithinMicro) and applies the `precision` 
truncation.
+      val instant = parseTimestampToInstant(s, timeZoneId)
+      if (instant == null) None else Some(instantToTimestampNanos(instant, 
precision))
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  def stringToTimestampLTZNanosAnsi(
+      s: UTF8String,
+      precision: Int,
+      timeZoneId: ZoneId,
+      context: QueryContext = null): TimestampNanosVal = {
+    stringToTimestampLTZNanos(s, precision, timeZoneId).getOrElse {
+      throw ExecutionErrors.invalidInputInCastToDatetimeError(
+        s,
+        TimestampLTZNanosType(precision),
+        context)
+    }
+  }
+
+  /**
+   * Trims and parses a given UTF8 string into a [[TimestampNanosVal]] (epoch 
microseconds plus a
+   * sub-microsecond remainder in [0, 999]) for `TIMESTAMP_NTZ(precision)` 
with `precision` in [7,
+   * 9]. Fractional digits beyond `precision` are truncated. The result is 
independent of time
+   * zones; a time zone component is discarded when `allowTimeZone` is `true` 
and rejected
+   * (returns `None`) otherwise. The return type is [[Option]] in order to 
distinguish between a
+   * valid zero value and null. Please refer to `parseTimestampString` for the 
allowed formats.
+   */
+  def stringToTimestampNTZNanos(
+      s: UTF8String,
+      precision: Int,
+      allowTimeZone: Boolean = true): Option[TimestampNanosVal] = {
+    if (precision < 7 || precision > 9) {
+      throw SparkException.internalError(
+        s"stringToTimestampNTZNanos: precision $precision is out of range [7, 
9]")
+    }
+    try {
+      // `null` here means the string was unparseable (see 
`parseTimestampToLocalDateTime`). The
+      // shared helper carries the full fraction in the `LocalDateTime`;
+      // `localDateTimeToTimestampNanos` then splits it into (epochMicros, 
nanosWithinMicro) and
+      // applies the `precision` truncation.
+      val localDateTime = parseTimestampToLocalDateTime(s, allowTimeZone)
+      if (localDateTime == null) {
+        None
+      } else {
+        Some(localDateTimeToTimestampNanos(localDateTime, precision))
       }
-      val nanoseconds = MICROSECONDS.toNanos(segments(6))
-      val localTime = LocalTime.of(segments(3), segments(4), segments(5), 
nanoseconds.toInt)
-      val localDate = LocalDate.of(segments(0), segments(1), segments(2))
-      val localDateTime = LocalDateTime.of(localDate, localTime)
-      Some(localDateTimeToMicros(localDateTime))
     } catch {
       case NonFatal(_) => None
     }
   }
 
+  /**
+   * ANSI variant of [[stringToTimestampNTZNanos]]. Throws
+   * [[org.apache.spark.SparkDateTimeException]] on invalid input. Uses 
`allowTimeZone = true`: a
+   * time zone component in the string is silently discarded rather than 
rejected. Callers that
+   * need strict NTZ rejection should call [[stringToTimestampNTZNanos]] 
directly with
+   * `allowTimeZone = false`.
+   */
+  def stringToTimestampNTZNanosAnsi(
+      s: UTF8String,
+      precision: Int,
+      context: QueryContext = null): TimestampNanosVal = {
+    // TODO(SPARK-57032): when this is wired to a user-facing CAST(... AS 
TIMESTAMP_NTZ(p)), the
+    // cast must decide `allowTimeZone` explicitly (per ANSI/legacy mode) 
instead of relying on
+    // the `true` default used here, which silently discards a zone suffix.
+    stringToTimestampNTZNanos(s, precision).getOrElse {
+      throw ExecutionErrors.invalidInputInCastToDatetimeError(
+        s,
+        TimestampNTZNanosType(precision),
+        context)
+    }
+  }
+
   /**
    * Trims and parses a given UTF8 string to a corresponding [[Long]] value 
which representing the
    * number of microseconds since the midnight. The result will be independent 
of time zones.
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampNanosParseSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampNanosParseSuite.scala
new file mode 100644
index 000000000000..3a4d758da892
--- /dev/null
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampNanosParseSuite.scala
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import java.time.{ZoneId, ZoneOffset}
+
+import org.apache.spark.{SparkDateTimeException, SparkException, SparkFunSuite}
+import org.apache.spark.sql.catalyst.util.DateTimeTestUtils._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils._
+import org.apache.spark.unsafe.types.{TimestampNanosVal, UTF8String}
+
+/**
+ * Tests for string-to-nanosecond timestamp parsing added under SPARK-57032. 
The parser keeps the
+ * microsecond part (fractional digits 1-6) and the sub-microsecond remainder 
(digits 7-9, in
+ * [0, 999]) and applies the target fractional precision `p` in [7, 9] by 
truncating extra digits.
+ */
+class TimestampNanosParseSuite extends SparkFunSuite {
+
+  private val losAngeles = getZoneId("America/Los_Angeles")
+
+  private def ntz(
+      str: String,
+      precision: Int,
+      allowTimeZone: Boolean = true): Option[TimestampNanosVal] = {
+    stringToTimestampNTZNanos(UTF8String.fromString(str), precision, 
allowTimeZone)
+  }
+
+  private def ltz(str: String, precision: Int, zoneId: ZoneId): 
Option[TimestampNanosVal] = {
+    stringToTimestampLTZNanos(UTF8String.fromString(str), precision, zoneId)
+  }
+
+  test("NTZ: fractional digits 7-9 are preserved as nanosWithinMicro") {
+    assert(ntz("2015-01-02 00:00:00.123456789", 9).get ===
+      TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 123456, 
ZoneOffset.UTC), 789.toShort))
+    assert(ntz("2015-01-02 00:00:00.1234567", 9).get ===
+      TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 123456, 
ZoneOffset.UTC), 700.toShort))
+    assert(ntz("2015-01-02 00:00:00.12345678", 9).get ===
+      TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 123456, 
ZoneOffset.UTC), 780.toShort))
+  }
+
+  test("NTZ: precision truncates excess sub-microsecond digits toward zero") {
+    val micros = date(2020, 12, 31, 23, 59, 59, 123456, ZoneOffset.UTC)
+    assert(ntz("2020-12-31 23:59:59.123456789", 9).get ===
+      TimestampNanosVal.fromParts(micros, 789.toShort))
+    assert(ntz("2020-12-31 23:59:59.123456789", 8).get ===
+      TimestampNanosVal.fromParts(micros, 780.toShort))
+    assert(ntz("2020-12-31 23:59:59.123456789", 7).get ===
+      TimestampNanosVal.fromParts(micros, 700.toShort))
+  }
+
+  test("NTZ: digits beyond the 9th are dropped") {
+    val expected = TimestampNanosVal.fromParts(
+      date(2020, 12, 31, 23, 59, 59, 123456, ZoneOffset.UTC), 789.toShort)
+    assert(ntz("2020-12-31 23:59:59.1234567890", 9).get === expected)
+    assert(ntz("2020-12-31 23:59:59.123456789999", 9).get === expected)
+  }
+
+  test("NTZ: fewer than 6 fractional digits yield zero nanosWithinMicro") {
+    assert(ntz("2020-01-01 00:00:00.0", 9).get ===
+      TimestampNanosVal.fromParts(date(2020, 1, 1, 0, 0, 0, 0, 
ZoneOffset.UTC), 0.toShort))
+    assert(ntz("2020-01-01 00:00:00.1", 9).get ===
+      TimestampNanosVal.fromParts(date(2020, 1, 1, 0, 0, 0, 100000, 
ZoneOffset.UTC), 0.toShort))
+    assert(ntz("2020-01-01 00:00:00.123456", 9).get ===
+      TimestampNanosVal.fromParts(date(2020, 1, 1, 0, 0, 0, 123456, 
ZoneOffset.UTC), 0.toShort))
+  }
+
+  test("NTZ: trailing zeros in the sub-microsecond part") {
+    assert(ntz("2015-01-02 00:00:00.000050000", 9).get ===
+      TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 50, 
ZoneOffset.UTC), 0.toShort))
+    assert(ntz("2015-01-02 00:00:00.100000009", 9).get ===
+      TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 100000, 
ZoneOffset.UTC), 9.toShort))
+  }
+
+  test("NTZ: maximum and minimum sub-microsecond fractions") {
+    assert(ntz("2020-06-15 12:00:00.999999999", 9).get ===
+      TimestampNanosVal.fromParts(date(2020, 6, 15, 12, 0, 0, 999999, 
ZoneOffset.UTC), 999.toShort))
+    assert(ntz("2020-06-15 12:00:00.000000001", 9).get ===
+      TimestampNanosVal.fromParts(date(2020, 6, 15, 12, 0, 0, 0, 
ZoneOffset.UTC), 1.toShort))
+    // ".000000001" loses its only sub-micro digit at precision 8 and 7.
+    assert(ntz("2020-06-15 12:00:00.000000001", 8).get.nanosWithinMicro === 
0.toShort)
+    assert(ntz("2020-06-15 12:00:00.000000001", 7).get.nanosWithinMicro === 
0.toShort)
+  }
+
+  test("NTZ: time zone component is discarded or rejected based on 
allowTimeZone") {
+    // With allowTimeZone = true (default) the zone suffix is discarded.
+    assert(ntz("2015-03-18T12:03:17.123456789Z", 9).get ===
+      TimestampNanosVal.fromParts(
+        date(2015, 3, 18, 12, 3, 17, 123456, ZoneOffset.UTC), 789.toShort))
+    // With allowTimeZone = false a zone suffix makes the input invalid.
+    assert(ntz("2015-03-18T12:03:17.123456789Z", 9, allowTimeZone = 
false).isEmpty)
+    // A time-only input cannot be parsed as TIMESTAMP_NTZ.
+    assert(ntz("12:03:17.123456789", 9).isEmpty)
+  }
+
+  test("LTZ: explicit zone offset in the string") {
+    val expected = TimestampNanosVal.fromParts(
+      date(2015, 3, 18, 12, 3, 17, 123456, getZoneId("+07:00")), 789.toShort)
+    assert(ltz("2015-03-18T12:03:17.123456789+07:00", 9, ZoneOffset.UTC).get 
=== expected)
+  }
+
+  test("LTZ: region-based zone in the string") {
+    val expected = TimestampNanosVal.fromParts(
+      date(2015, 3, 18, 12, 3, 17, 123456, getZoneId("Europe/Moscow")), 
789.toShort)
+    assert(ltz("2015-03-18T12:03:17.123456789 Europe/Moscow", 9, 
ZoneOffset.UTC).get === expected)
+  }
+
+  test("LTZ: falls back to the session zone when the string has no zone") {
+    val expected = TimestampNanosVal.fromParts(
+      date(2015, 3, 18, 12, 3, 17, 123456, losAngeles), 789.toShort)
+    assert(ltz("2015-03-18 12:03:17.123456789", 9, losAngeles).get === 
expected)
+  }
+
+  test("LTZ: precision truncation matches the NTZ path") {
+    val micros = date(2015, 3, 18, 12, 3, 17, 123456, ZoneOffset.UTC)
+    assert(ltz("2015-03-18T12:03:17.123456789Z", 7, ZoneOffset.UTC).get ===
+      TimestampNanosVal.fromParts(micros, 700.toShort))
+    assert(ltz("2015-03-18T12:03:17.123456789Z", 8, ZoneOffset.UTC).get ===
+      TimestampNanosVal.fromParts(micros, 780.toShort))
+  }
+
+  test("range edge cases with sub-microsecond fractions") {
+    // Unix epoch.
+    assert(ntz("1970-01-01 00:00:00.000000001", 9).get ===
+      TimestampNanosVal.fromParts(0L, 1.toShort))
+    // Julian/Gregorian cutover.
+    assert(ntz("1582-10-15 00:00:00.123456789", 9).get ===
+      TimestampNanosVal.fromParts(date(1582, 10, 15, 0, 0, 0, 123456, 
ZoneOffset.UTC), 789.toShort))
+    // End of the supported range.
+    assert(ntz("9999-12-31 23:59:59.999999999", 9).get ===
+      TimestampNanosVal.fromParts(
+        date(9999, 12, 31, 23, 59, 59, 999999, ZoneOffset.UTC), 999.toShort))
+  }
+
+  test("null input returns None") {
+    assert(stringToTimestampNTZNanos(null, 9).isEmpty)
+    assert(stringToTimestampLTZNanos(null, 9, ZoneOffset.UTC).isEmpty)
+  }
+
+  test("invalid inputs return None") {
+    assert(ntz("not a timestamp", 9).isEmpty)
+    assert(ntz("", 9).isEmpty)
+    assert(ltz("2015-13-40 99:99:99.123456789", 9, ZoneOffset.UTC).isEmpty)
+  }
+
+  test("LTZ: time-only input uses the session zone's current date") {
+    // Time-only strings are accepted by the LTZ path (date is filled with 
LocalDate.now);
+    // they are rejected by the NTZ path because the date is indeterminate.
+    val result = ltz("12:03:17.123456789", 9, ZoneOffset.UTC)
+    assert(result.isDefined)
+    assert(result.get.nanosWithinMicro === 789.toShort)
+    assert(ntz("12:03:17.123456789", 9).isEmpty)
+  }
+
+  test("pre-epoch (negative) timestamps with sub-microsecond fractions") {
+    // Exercises the yearSign path together with segments(9).
+    assert(ntz("-0001-01-01 00:00:00.000000001", 9).get ===
+      TimestampNanosVal.fromParts(
+        date(-1, 1, 1, 0, 0, 0, 0, ZoneOffset.UTC), 1.toShort))
+    assert(ntz("1582-10-14 23:59:59.999999999", 9).get ===
+      TimestampNanosVal.fromParts(
+        date(1582, 10, 14, 23, 59, 59, 999999, ZoneOffset.UTC), 999.toShort))
+  }
+
+  test("micro path through parseTimestampString is unchanged by the nanos 
extension") {
+    // Regression guard for the highest-blast-radius change: growing the 
segments array and
+    // pinning the parse-loop bound must not alter the microsecond results 
returned by the
+    // existing stringToTimestamp / stringToTimestampWithoutTimeZone APIs. On 
the micro path the
+    // sub-microsecond digits 7-9 are dropped, exactly as before this change.
+    def micros(str: String): Option[Long] =
+      stringToTimestamp(UTF8String.fromString(str), ZoneOffset.UTC)
+    def microsNtz(str: String): Option[Long] =
+      stringToTimestampWithoutTimeZone(UTF8String.fromString(str), 
allowTimeZone = true)
+
+    // 9 fractional digits: still truncated to 6 (micros); digits 7-9 ignored 
on the micro path.
+    assert(micros("2015-01-02 00:00:00.123456789") ===
+      Some(date(2015, 1, 2, 0, 0, 0, 123456, ZoneOffset.UTC)))
+    assert(microsNtz("2015-01-02 00:00:00.123456789") ===
+      Some(date(2015, 1, 2, 0, 0, 0, 123456, ZoneOffset.UTC)))
+    // Fewer than 6 fractional digits still right-pad to micros.
+    assert(microsNtz("2015-01-02 00:00:00.1") ===
+      Some(date(2015, 1, 2, 0, 0, 0, 100000, ZoneOffset.UTC)))
+    // Exactly 6 fractional digits are unchanged.
+    assert(microsNtz("2015-01-02 00:00:00.000456") ===
+      Some(date(2015, 1, 2, 0, 0, 0, 456, ZoneOffset.UTC)))
+    // 10+ fractional digits are still accepted and truncated to micros.
+    assert(microsNtz("2015-01-02 00:00:00.1234567890") ===
+      Some(date(2015, 1, 2, 0, 0, 0, 123456, ZoneOffset.UTC)))
+  }
+
+  test("stringToTimestampNTZNanos throws internalError for out-of-range 
precision") {
+    // Precision must be in [7, 9]; anything outside is a caller bug and 
should surface loudly.
+    Seq(0, 6, 10, -1).foreach { p =>
+      checkError(
+        exception = intercept[SparkException] {
+          stringToTimestampNTZNanos(
+            UTF8String.fromString("2020-01-01 00:00:00.123456789"), p)
+        },
+        condition = "INTERNAL_ERROR",
+        parameters = Map(
+          "message" -> s"stringToTimestampNTZNanos: precision $p is out of 
range [7, 9]"))
+    }
+  }
+
+  test("ANSI NTZ: time zone component in the string is silently discarded") {
+    // allowTimeZone defaults to true in the ANSI variant: the zone suffix is 
dropped, not
+    // rejected. Callers that need strict rejection must use 
stringToTimestampNTZNanos directly
+    // with allowTimeZone = false.
+    val result = stringToTimestampNTZNanosAnsi(
+      UTF8String.fromString("2015-03-18T12:03:17.123456789Z"), 9)
+    assert(result ===
+      TimestampNanosVal.fromParts(
+        date(2015, 3, 18, 12, 3, 17, 123456, ZoneOffset.UTC), 789.toShort))
+  }
+
+  test("ANSI variants throw on invalid input") {
+    val ntzValid = stringToTimestampNTZNanosAnsi(
+      UTF8String.fromString("2015-01-02 00:00:00.123456789"), 9)
+    assert(ntzValid ===
+      TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 123456, 
ZoneOffset.UTC), 789.toShort))
+
+    val ltzValid = stringToTimestampLTZNanosAnsi(
+      UTF8String.fromString("2015-01-02 00:00:00.123456789Z"), 9, 
ZoneOffset.UTC)
+    assert(ltzValid ===
+      TimestampNanosVal.fromParts(date(2015, 1, 2, 0, 0, 0, 123456, 
ZoneOffset.UTC), 789.toShort))
+
+    intercept[SparkDateTimeException] {
+      stringToTimestampNTZNanosAnsi(UTF8String.fromString("invalid"), 9)
+    }
+    intercept[SparkDateTimeException] {
+      stringToTimestampLTZNanosAnsi(UTF8String.fromString("invalid"), 9, 
ZoneOffset.UTC)
+    }
+  }
+}


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch branch-4.x updated: [SPARK-57032][SQL] Extend timestamp string parsing for nanosecond fractional precision

Reply via email to