Licht-T opened a new pull request, #55610:
URL: https://github.com/apache/spark/pull/55610
### What changes were proposed in this pull request?
This PR re-introduces a fast path in `DateTimeUtils.truncTimestamp` for
`TRUNC_TO_MINUTE`, `TRUNC_TO_HOUR`, and `TRUNC_TO_DAY`. A new private
`truncToUnitFast(micros, zoneId, unitMicros, fallbackUnit)` helper:
1. Resolves the zone offset for the input instant once
(`zoneId.getRules.getOffset(...)`).
2. Truncates by `Math.floorMod` in local time: `local = micros +
offsetMicros`, then `truncatedLocal = local - floorMod(local, unitMicros)`,
then back: `candidate = truncatedLocal - offsetMicros`.
3. Re-checks the offset at `candidate` and falls back to the existing
`truncatedTo()`-based slow path if it differs from the original offset (the
SPARK-30766 / SPARK-30857 DST guard).
4. The DST re-check is skipped for `ZoneRules.isFixedOffset` zones.
5. `Math.addExact` / `subtractExact` guard the offset arithmetic; on
overflow we also fall back to the slow path.
The arithmetic also handles sub-minute LMT offsets (e.g. America/Los_Angeles
`-07:52:58` pre-1883 - the SPARK-33404 case) and 30/45-minute zones
(Asia/Kolkata `+05:30`, Asia/Kathmandu `+05:45`) without an offset-alignment
guard, because the offset is applied as part of the floorMod.
Date-level units (WEEK/MONTH/QUARTER/YEAR + `trunc(date, ...)`) are not
touched in this PR; they still go through `microsToDays` / `daysToMicros`.
Those can follow up in a separate ticket.
### Why are the changes needed?
SPARK-33404 (Nov 2020) fixed a correctness bug - `date_trunc('minute',
'1769-10-17 17:10:02')` in `America/Los_Angeles` returned the original
timestamp because the existing fast path ignored the time-zone offset and the
LMT offset has second granularity. The fix routed MINUTE/HOUR/DAY through
`microsToInstant().atZone(zoneId).truncatedTo(unit)`, which allocates
`ZonedDateTime` per row.
The author of that fix flagged the regression on the follow-up benchmark PR:
> "Slow down is around 5.5 times" - MaxGekk, apache/spark#30338
That regression has stood for ~5.5 years with no follow-up ticket. This PR
restores most of the lost throughput without re-introducing the LMT bug, by
applying the offset before/after the floorMod and verifying the offset is
unchanged at the truncated candidate (handling DST transitions that motivated
SPARK-30766 / SPARK-30857).
`DateTimeBenchmark` Truncation results, wholestage on, ns/row on a 12th Gen
Intel i7-1260P:
| level | baseline | this PR | speedup |
|---------|---------:|--------:|--------:|
| MINUTE | 136.0 | 69.5 | 1.96× |
| HOUR | 136.2 | 69.8 | 1.95× |
| DAY | 136.5 | 69.8 | 1.96× |
| DD | 136.7 | 70.1 | 1.95× |
Other levels are unchanged (within noise).
### Does this PR introduce _any_ user-facing change?
No. The output of `date_trunc` for `MINUTE`/`HOUR`/`DAY` is identical to
master in all cases; only the internal implementation changes.
### How was this patch tested?
- `DateTimeUtilsSuite` - all 61 tests pass, including:
- `SPARK-33404: test truncTimestamp when time zone offset from UTC has a
granularity of seconds`, which loops MINUTE/SECOND/MILLI/MICRO truncation
across **every** available zone (covers historical LMT zones).
- The existing `truncTimestamp` test, which loops HOUR/DAY/MINUTE/etc.
across every available zone.
- Three new tests added in this PR:
- `truncTimestamp with sub-hour zone offsets` - Asia/Kolkata `+05:30`
and Asia/Kathmandu `+05:45` HOUR/DAY truncation.
- `truncTimestamp across DST transitions` - America/Los_Angeles
spring-forward (2024-03-10) and fall-back (2024-11-03) for HOUR/DAY, exercising
the offset-equality fallback.
- `SPARK-30766/30857: truncTimestamp before the epoch in HOUR/DAY` -
pre-1970 instant in America/Los_Angeles for HOUR and DAY.
- `DateExpressionsSuite` - all 68 tests pass.
- `DateTimeBenchmark` re-run locally with the numbers shown above. Per the
contributor guide, the official `benchmarks/DateTimeBenchmark-results.txt`
should be regenerated on the standard CI runner - happy to do that via the
GitHub Actions benchmark workflow on request.
### Was this patch authored or co-authored using generative AI tooling?
Yes, co-authored by Claude Code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]