Re: [PR] [SPARK-57285][SQL] Route nanosecond timestamp cast-to-string through the Types Framework [spark]

via GitHub Wed, 10 Jun 2026 00:52:49 -0700


MaxGekk commented on code in PR #56355:
URL: https://github.com/apache/spark/pull/56355#discussion_r3386465478



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala:
##########
@@ -66,16 +66,14 @@ trait ToStringBase { self: UnaryExpression with 
TimeZoneAwareExpression =>
       case NoConstraint => castToString(from)
     }
 
-  private def castToString(from: DataType): Any => UTF8String = from match {
-    // Nanosecond timestamp string formatting is zone-aware (LTZ renders in 
the session time zone),
-    // so it lives in castToStringDefault alongside the microsecond timestamp 
types rather than the
-    // zone-less Types Framework formatter (SPARK-57256).
-    case _: TimestampNTZNanosType | _: TimestampLTZNanosType => 
castToStringDefault(from)
-    case _ =>
-      TypeApiOps(from)
-        .map(ops => acceptAny[Any](v => ops.formatUTF8(v)))
-        .getOrElse(castToStringDefault(from))
-  }
+  // The Types Framework is the single integration point for framework types' 
cast-to-string, via
+  // the zone-less formatUTF8. The cast's session zone is threaded into the 
lookup so TIMESTAMP_LTZ
+  // nanos renders in it; zone-independent types (TimeType, TIMESTAMP_NTZ 
nanos) ignore it
+  // (SPARK-57285).
+  private def castToString(from: DataType): Any => UTF8String =

Review Comment:
   Good catch, and yes — this is a conscious trade-off. A few points on why the 
cost is bounded:
   
   - Codegen (the default) is unaffected: the ops instance is resolved once at 
code-generation time (the `TypeApiOps(from, zoneId).get` reference object), not 
per row.
   - The interpreted per-row cost is a thin wrapper allocation, not a formatter 
rebuild. `TimestampFormatter.getFractionFormatter` reuses the shared static 
`DateTimeFormatterHelper.fractionFormatter` `DateTimeFormatter`; the LTZ ops 
only adds a lazy `withZone(zoneId)` wrapper, and the heavier `legacyFormatter` 
is lazy and only built on the diff-check error fallback, not on the normal 
nanos render path. So we're not re-parsing a pattern per row — just allocating 
a small `FractionTimestampFormatter` (and the ops object) per top-level row, 
reused across the elements of that row.
   - This mirrors the existing `TimeType` per-row-ops shape on master, so the 
framework stays internally consistent.
   
   Net: a small per-row object allocation for interpreted execution of nested 
nanos collections, which I think is acceptable. Longer term, the clean fix is 
framework-wide — cache the resolved ops per `(DataType, zoneId)` at the 
`TypeApiOps` lookup layer so the interpreted nested closures reuse one instance 
(this would also subsume the equivalent `TimeType` allocation) rather than 
special-casing nanos here. I can file a follow-up JIRA to track that if you'd 
like.
   



##########
sql/api/src/main/scala/org/apache/spark/sql/types/ops/TypeApiOps.scala:
##########
@@ -152,15 +155,24 @@ object TypeApiOps {
    *
    * @param dt
    *   the DataType to get operations for
+   * @param zoneId
+   *   the session time zone for zone-aware rendering (TIMESTAMP_LTZ nanos). 
CAST passes the
+   *   cast's resolved zone; zone-less callers (EXPLAIN / SQL-literal / Row 
JSON) accept the

Review Comment:
   Thanks — both parts are fair, addressed as follows:
   
   1. `Row.jsonValue`: you're right it's the external-value path and 
`ops.format` would hit a `ClassCastException` (`Instant`/`LocalDateTime` → 
`TimestampNanosVal`). This was actually a pre-existing latent bug for every 
framework type (`TimeType` hits the same `LocalTime` → `Long` CCE on master), 
so I split the structural fix into #56392 / SPARK-57338, which routes 
`Row.jsonValue` through `formatExternal` and is now merged. I've rebased this 
PR on top, and the nanos ops now implement `formatExternal` to render the 
external `Instant`/`LocalDateTime` at the column precision (NTZ 
zone-independent, LTZ in the session zone) — matching the CAST output. So 
`Row(...).json` on a nanos LTZ/NTZ column now renders the value instead of 
throwing, and I added a `RowJsonSuite` test covering both NTZ and LTZ at 
precision 9.
   2. user-facing-change scope: agreed — `EXPLAIN` and SQL-literal `toSQLValue` 
don't route through the framework (`Literal.toString`/`Literal.sql` render via 
`value.toString`, and `toSQLValue` has no production caller), so they neither 
raised before nor change now. I've narrowed the description to the two real 
deltas (CAST path + `Row.json` on nanos) and dropped the `EXPLAIN`/`toSQLValue` 
claims.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57285][SQL] Route nanosecond timestamp cast-to-string through the Types Framework [spark]

Reply via email to