MaxGekk commented on code in PR #56355:
URL: https://github.com/apache/spark/pull/56355#discussion_r3386465478
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala:
##########
@@ -66,16 +66,14 @@ trait ToStringBase { self: UnaryExpression with
TimeZoneAwareExpression =>
case NoConstraint => castToString(from)
}
- private def castToString(from: DataType): Any => UTF8String = from match {
- // Nanosecond timestamp string formatting is zone-aware (LTZ renders in
the session time zone),
- // so it lives in castToStringDefault alongside the microsecond timestamp
types rather than the
- // zone-less Types Framework formatter (SPARK-57256).
- case _: TimestampNTZNanosType | _: TimestampLTZNanosType =>
castToStringDefault(from)
- case _ =>
- TypeApiOps(from)
- .map(ops => acceptAny[Any](v => ops.formatUTF8(v)))
- .getOrElse(castToStringDefault(from))
- }
+ // The Types Framework is the single integration point for framework types'
cast-to-string, via
+ // the zone-less formatUTF8. The cast's session zone is threaded into the
lookup so TIMESTAMP_LTZ
+ // nanos renders in it; zone-independent types (TimeType, TIMESTAMP_NTZ
nanos) ignore it
+ // (SPARK-57285).
+ private def castToString(from: DataType): Any => UTF8String =
Review Comment:
Good catch, and yes — this is a conscious trade-off. A few points on why the
cost is bounded:
- Codegen (the default) is unaffected: the ops instance is resolved once at
code-generation time (the `TypeApiOps(from, zoneId).get` reference object), not
per row.
- The interpreted per-row cost is a thin wrapper allocation, not a formatter
rebuild. `TimestampFormatter.getFractionFormatter` reuses the shared static
`DateTimeFormatterHelper.fractionFormatter` `DateTimeFormatter`; the LTZ ops
only adds a lazy `withZone(zoneId)` wrapper, and the heavier `legacyFormatter`
is lazy and only built on the diff-check error fallback, not on the normal
nanos render path. So we're not re-parsing a pattern per row — just allocating
a small `FractionTimestampFormatter` (and the ops object) per top-level row,
reused across the elements of that row.
- This mirrors the existing `TimeType` per-row-ops shape on master, so the
framework stays internally consistent.
Net: a small per-row object allocation for interpreted execution of nested
nanos collections, which I think is acceptable. Longer term, the clean fix is
framework-wide — cache the resolved ops per `(DataType, zoneId)` at the
`TypeApiOps` lookup layer so the interpreted nested closures reuse one instance
(this would also subsume the equivalent `TimeType` allocation) rather than
special-casing nanos here. I can file a follow-up JIRA to track that if you'd
like.
##########
sql/api/src/main/scala/org/apache/spark/sql/types/ops/TypeApiOps.scala:
##########
@@ -152,15 +155,24 @@ object TypeApiOps {
*
* @param dt
* the DataType to get operations for
+ * @param zoneId
+ * the session time zone for zone-aware rendering (TIMESTAMP_LTZ nanos).
CAST passes the
+ * cast's resolved zone; zone-less callers (EXPLAIN / SQL-literal / Row
JSON) accept the
Review Comment:
Thanks — both parts are fair, addressed as follows:
1. `Row.jsonValue`: you're right it's the external-value path and
`ops.format` would hit a `ClassCastException` (`Instant`/`LocalDateTime` →
`TimestampNanosVal`). This was actually a pre-existing latent bug for every
framework type (`TimeType` hits the same `LocalTime` → `Long` CCE on master),
so I split the structural fix into #56392 / SPARK-57338, which routes
`Row.jsonValue` through `formatExternal` and is now merged. I've rebased this
PR on top, and the nanos ops now implement `formatExternal` to render the
external `Instant`/`LocalDateTime` at the column precision (NTZ
zone-independent, LTZ in the session zone) — matching the CAST output. So
`Row(...).json` on a nanos LTZ/NTZ column now renders the value instead of
throwing, and I added a `RowJsonSuite` test covering both NTZ and LTZ at
precision 9.
2. user-facing-change scope: agreed — `EXPLAIN` and SQL-literal `toSQLValue`
don't route through the framework (`Literal.toString`/`Literal.sql` render via
`value.toString`, and `toSQLValue` has no production caller), so they neither
raised before nor change now. I've narrowed the description to the two real
deltas (CAST path + `Row.json` on nanos) and dropped the `EXPLAIN`/`toSQLValue`
claims.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]