uros-b commented on code in PR #56355:
URL: https://github.com/apache/spark/pull/56355#discussion_r3376305734
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala:
##########
@@ -66,16 +66,14 @@ trait ToStringBase { self: UnaryExpression with
TimeZoneAwareExpression =>
case NoConstraint => castToString(from)
}
- private def castToString(from: DataType): Any => UTF8String = from match {
- // Nanosecond timestamp string formatting is zone-aware (LTZ renders in
the session time zone),
- // so it lives in castToStringDefault alongside the microsecond timestamp
types rather than the
- // zone-less Types Framework formatter (SPARK-57256).
- case _: TimestampNTZNanosType | _: TimestampLTZNanosType =>
castToStringDefault(from)
- case _ =>
- TypeApiOps(from)
- .map(ops => acceptAny[Any](v => ops.formatUTF8(v)))
- .getOrElse(castToStringDefault(from))
- }
+ // The Types Framework is the single integration point for framework types'
cast-to-string, via
+ // the zone-less formatUTF8. The cast's session zone is threaded into the
lookup so TIMESTAMP_LTZ
+ // nanos renders in it; zone-independent types (TimeType, TIMESTAMP_NTZ
nanos) ignore it
+ // (SPARK-57285).
+ private def castToString(from: DataType): Any => UTF8String =
Review Comment:
Not blocking for this particular PR, but more of a performance discussion:
For a nanos timestamp nested in an array/map/struct, castToString(et) is
invoked inside the per-row element closure, so each row constructs a fresh ops
instance whose lazy formatter builds a new TimestampFormatter. Previously the
nested path reused the ToStringBase-level shared timestampFormatter /
timestampNTZFormatter. This is a (bounded) per-row allocation regression for
interpreted execution of nested nanos timestamps.
The same per-row-ops pattern already exists for TimeType on master, so it is
consistent with the framework's current shape. However, the difference is that
TimestampFormatter.getFractionFormatter is heavier than the time formatter. The
codegen path is unaffected (ops built once at codegen time). Probably
acceptable, but worth confirming it's a conscious trade-off.
Codegen (the default) is unaffected, so are we aware of the interpreted
nested-collection casts performance implications, and do we have any long-term
plans to address this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]