[
https://issues.apache.org/jira/browse/SPARK-56745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-56745:
-----------------------------------
Labels: pull-request-available (was: )
> Cache foldable ZoneId in ConvertTimezone to avoid per-row lookup
> ----------------------------------------------------------------
>
> Key: SPARK-56745
> URL: https://issues.apache.org/jira/browse/SPARK-56745
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.1.1
> Reporter: Tongwei
> Priority: Major
> Labels: pull-request-available
>
> The `ConvertTimezone` expression resolves both source and target timezone
> arguments via `DateTimeUtils.getZoneId` on every row, even when the timezone
> arguments are constant literals -- which is the typical usage:
> convert_timezone('UTC', 'America/Los_Angeles', ts_col)
> Each `getZoneId` call performs a regex normalization
> (`Pattern.matcher().replaceFirst()`)
> followed by a `ZoneId.of(..., ZoneId.SHORT_IDS)` lookup, which goes through
> `ZoneRulesProvider`'s internal map. Doing this twice per row is wasteful
> when
> the result is the same for the entire query.
> The codegen paths of sibling expressions `FromUTCTimestamp` and
> `ToUTCTimestamp`
> already cache the foldable `ZoneId` via `addMutableState` (see
> `datetimeExpressions.scala:1810-1844`). This proposal brings
> `ConvertTimezone`
> in line:
> * Add a `@transient lazy val` for foldable source/target zones
> (interpreted path).
> * Generate `addMutableState`-cached `ZoneId` terms when timezone args are
> foldable (codegen path); fall back to per-row resolution otherwise.
> * Add a `convertTimestampNtzToAnotherTz(ZoneId, ZoneId, Long)` overload in
> `DateTimeUtils` so callers can pass pre-resolved zones.
> * Short-circuit to NULL when a foldable timezone literal is null.
> Expected impact: ~1.3-2x speedup of the `convert_timezone` function in the
> common foldable-arguments case; meaningful (single-digit to low double-digit
> percentage) end-to-end speedup for ETL workloads where `convert_timezone` is
> on the hot path.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]