Licht-T opened a new pull request, #55736:
URL: https://github.com/apache/spark/pull/55736

   ### What changes were proposed in this pull request?
   
   This PR extends the offset-arithmetic + DST-equality-guard fast path 
introduced in SPARK-56663 from MIN/HR/DAY to the date-level units WEEK / MONTH 
/ QUARTER / YEAR.
   
   The framework for offset-based truncation -- resolve offset once, apply, 
truncate in the local frame, re-apply, DST guard, fall back on DST-cross or 
arithmetic overflow -- is identical for every level above SECOND. Only the 
"truncate in local frame" step varies. This PR inlines SPARK-56663's 
`truncToUnitFast` together with the new date-level path directly into 
`truncTimestamp`, and keeps a single private `truncTimestampSlow` as a complete 
reference implementation that the fast path falls back to:
   
   ```scala
   def truncTimestamp(micros: Long, level: Int, zoneId: ZoneId): Long = {
     // MICROSECOND / MILLISECOND / SECOND short-circuits (no zone work).
     // Offset arithmetic for every other level.
     // DST guard, fallback to truncTimestampSlow.
   }
   
   private def truncTimestampSlow(micros: Long, level: Int, zoneId: ZoneId): 
Long
   ```
   
   The local-frame truncation step is the only thing the fast path branches on:
   
   - `MICROSECOND` / `MILLISECOND` / `SECOND` - pure UTC `floorMod` (zone 
offsets have at most second precision per `java.time.ZoneOffset`; no zone 
information needed).
   - `MINUTE` / `HOUR` / `DAY` - shifted-local `floorMod` against the unit 
micros.
   - `WEEK` / `MONTH` / `QUARTER` / `YEAR` - compute local epoch-day by integer 
division, run `truncDate` in the local-day frame, multiply back to local micros.
   
   Everything else (offset resolve via `rules.getOffset`, `addExact` / 
`subtractExact`, DST guard via offset-equality at the candidate, slow-path 
fallback) is shared.
   
   The DST guard fires correctly for the new date-level cases - for example, 
YEAR truncation of a March instant in `America/Los_Angeles` produces a 
candidate at Jan 1 (which is in PST, offset -8) while the original is in PDT 
(offset -7); the offsets differ, so the path falls back to the slow 
`microsToDays` / `daysToMicros` route which uses `ZonedDateTime.resolveLocal` 
to land on Jan 1 00:00 PST.
   
   This PR also rewrites `TRUNC_TO_QUARTER` from `IsoFields.DAY_OF_QUARTER` (a 
`TemporalAdjuster` that produces a fresh `LocalDate`) to a direct 
`withMonth(firstMonthOfQuarter).withDayOfMonth(1)` chain on the existing 
`LocalDate`. Saves one allocation + the adjuster overhead per call.
   
   `truncTimestampSlow` covers every level explicitly so it serves as a 
self-contained reference implementation - the fast path's correctness can be 
verified against it case-by-case.
   
   ### Why are the changes needed?
   
   SPARK-33404 (Nov 2020) routed every `date_trunc` level above SECOND through 
`microsToInstant().atZone(zoneId).truncatedTo(unit)` for correctness, costing 
~5.5× throughput per the follow-up benchmark PR (apache/spark#30338). 
SPARK-56663 recovered most of that for MIN/HR/DAY using the offset-arithmetic + 
DST-guard pattern. This PR extends the same recovery to WEEK / MONTH / QUARTER 
/ YEAR - the levels that drive monthly/quarterly aggregations in analytics 
workloads.
   
   `DateTimeBenchmark` Truncation results, wholestage on, ns/row on a 12th Gen 
Intel i7-1260P (master = pre-SPARK-56663):
   
   | level   | master baseline | this PR | speedup |
   |---------|---------------:|--------:|--------:|
   | WEEK    | 165.2 |  78.2 | 2.11× |
   | MONTH   | 181.9 |  92.2 | 1.97× |
   | MM      | 182.2 |  92.5 | 1.97× |
   | MON     | 182.9 |  92.7 | 1.97× |
   | QUARTER | 216.8 | 108.8 | 1.99× |
   | YEAR    | 205.2 |  96.7 | 2.12× |
   | YYYY    | 205.8 |  96.9 | 2.12× |
   | YY      | 206.3 |  96.0 | 2.15× |
   
   Time-level units (MIN/HR/DAY/SECOND) and `trunc(date, ...)` are unchanged 
within noise; the hot path for those levels is byte-identical to SPARK-56663 
after the unification.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. The output of `date_trunc` is identical to master in all cases, 
including DST-spanning truncations (verified by the offset-equality guard + 
slow-path fallback, plus the new tests). Only the internal implementation 
changes.
   
   ### How was this patch tested?
   
   - `DateTimeUtilsSuite` - all 66 tests pass, including:
     - `SPARK-33404: test truncTimestamp when time zone offset from UTC has a 
granularity of seconds`, extended to also exercise WEEK / MONTH / QUARTER / 
YEAR with the 1769-10-17 LMT timestamp across **every** available zone (the 
existing loop already covered SECOND/MILLI/MICRO; SPARK-56663 added HOUR/DAY; 
this PR completes the matrix).
     - The existing `truncTimestamp` test, which loops WEEK / MONTH / QUARTER / 
YEAR for 2015 timestamps across every zone.
     - New test `truncTimestamp date-level units across DST boundaries` - 
covers YEAR / QUARTER truncation that crosses the LA spring-forward (DST guard 
fires, fallback path runs) and MONTH truncation entirely within DST (fast path 
stays).
   - `DateExpressionsSuite` - all tests pass (no changes to expression-level 
code, only the underlying `DateTimeUtils` helpers).
   - `DateTimeBenchmark` re-run via the GitHub Actions `Run benchmarks` 
workflow on this fork for JDK 17, 21, and 25; results committed back to the 
branch.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, co-authored with Claude Code.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to