[ 
https://issues.apache.org/jira/browse/SPARK-56769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rito Takeuchi updated SPARK-56769:
----------------------------------
    Description: 
h2. Background

`DateTimeUtils.truncTimestamp` for the WEEK / MONTH / QUARTER / YEAR levels
currently routes through:

{code:scala}
case _ => // Try to truncate date levels
  val dDays = microsToDays(micros, zoneId)
  daysToMicros(truncDate(dDays, level), zoneId)
{code}

`microsToDays` allocates `Instant` + `ZonedDateTime` + `LocalDate` per row;
`daysToMicros` allocates `LocalDate` + `ZonedDateTime` + `Instant`. `truncDate`
itself allocates one more `LocalDate` for MONTH/YEAR (in `getDayOfMonth` /
`getDayInYear`) and *two* for QUARTER (the existing implementation goes
through `IsoFields.DAY_OF_QUARTER`, which is a `TemporalAdjuster` that
produces a fresh `LocalDate`). The result is 167-218 ns/row on JDK 17 GH
Actions runners.

SPARK-56663 introduced the offset-arithmetic + DST-equality-guard pattern
for the time-level units (MINUTE / HOUR / DAY) and confirmed that the same
pattern is sound for any unit that evenly divides {{MICROS_PER_DAY}}. The
date-level branch is a natural extension.

h2. Proposal

Add a `truncDateFast` helper paralleling `truncToUnitFast` from SPARK-56663:

# Resolve the zone offset at `micros` once.
# Compute the local epoch-day by integer division: {{Math.floorDiv(micros + 
offsetMicros, MICROS_PER_DAY)}}.
# Run the existing `truncDate(localDays, level)` (pure integer math for WEEK; 
one `LocalDate` alloc for MONTH/YEAR).
# Convert the truncated day back to UTC micros: {{truncatedDays * 
MICROS_PER_DAY - offsetMicros}}.
# Verify the offset at the candidate equals the offset at the original
  (the SPARK-30766 / SPARK-30857 DST guard); fall back to the slow
  `microsToDays` / `daysToMicros` path if not.

Also rewrite `TRUNC_TO_QUARTER` from `IsoFields.DAY_OF_QUARTER` (a
`TemporalAdjuster` that produces a fresh `LocalDate`) to a direct
`withMonth(firstMonthOfQuarter).withDayOfMonth(1)` chain on the existing
`LocalDate`. Saves one allocation + the adjuster overhead.

h2. Benchmark

`DateTimeBenchmark` Truncation, wholestage on, ns/row, on a 12th Gen Intel
i7-1260P:

|| level || master baseline || this PR || speedup ||
| WEEK    | 165.2 | 78.2  | 2.11x |
| MONTH   | 181.9 | 92.2  | 1.97x |
| MM      | 182.2 | 92.5  | 1.97x |
| MON     | 182.9 | 92.7  | 1.97x |
| QUARTER | 216.8 | 108.8 | 1.99x |
| YEAR    | 205.2 | 96.7  | 2.12x |
| YYYY    | 205.8 | 96.9  | 2.12x |
| YY      | 206.3 | 96.0  | 2.15x |

Stacked on top of SPARK-56663, the cumulative speedup vs master is the
same range (since this PR only affects rows SPARK-56663 didn't touch).

h2. Out of scope

* `trunc(date, ...)` (date input, no zoneId) -- this PR only changes the
  `timestamp -> date_trunc` flow. The `TruncDate` expression bypasses
  `truncTimestamp` entirely; the only change visible to it is the
  `TRUNC_TO_QUARTER` rewrite (which `trunc(date, ...)` doesn't use in the
  benchmark today).
* MICROSECOND / MILLISECOND / SECOND / MINUTE / HOUR / DAY -- handled by
  SPARK-56663.
* Per-instance offset cache -- a separate optimization that would amortize
  the {{rules.getOffset}} call across rows. Would benefit both this PR's
  and SPARK-56663's paths. Out of scope here.
* Integer-only calendar arithmetic (Hinnant-style) -- would eliminate the
  remaining `LocalDate` allocation inside `truncDate` for MONTH/YEAR and
  push date-level units to the same floor as time-level units. Out of
  scope here.

h2. Related

* SPARK-56663 - introduced the offset-arithmetic fast path for MIN/HR/DAY;
  this PR extends the same pattern to the date-level units.
* SPARK-33404 - introduced the slow path that this family of changes is
  recovering from.
* SPARK-30766 / SPARK-30857 - the DST-correctness invariants from these
  fixes are preserved here via the offset-equality guard.


> Add fast path for date_trunc WEEK/MONTH/QUARTER/YEAR
> ----------------------------------------------------
>
>                 Key: SPARK-56769
>                 URL: https://issues.apache.org/jira/browse/SPARK-56769
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 5.0.0
>            Reporter: Rito Takeuchi
>            Priority: Major
>
> h2. Background
> `DateTimeUtils.truncTimestamp` for the WEEK / MONTH / QUARTER / YEAR levels
> currently routes through:
> {code:scala}
> case _ => // Try to truncate date levels
>   val dDays = microsToDays(micros, zoneId)
>   daysToMicros(truncDate(dDays, level), zoneId)
> {code}
> `microsToDays` allocates `Instant` + `ZonedDateTime` + `LocalDate` per row;
> `daysToMicros` allocates `LocalDate` + `ZonedDateTime` + `Instant`. 
> `truncDate`
> itself allocates one more `LocalDate` for MONTH/YEAR (in `getDayOfMonth` /
> `getDayInYear`) and *two* for QUARTER (the existing implementation goes
> through `IsoFields.DAY_OF_QUARTER`, which is a `TemporalAdjuster` that
> produces a fresh `LocalDate`). The result is 167-218 ns/row on JDK 17 GH
> Actions runners.
> SPARK-56663 introduced the offset-arithmetic + DST-equality-guard pattern
> for the time-level units (MINUTE / HOUR / DAY) and confirmed that the same
> pattern is sound for any unit that evenly divides {{MICROS_PER_DAY}}. The
> date-level branch is a natural extension.
> h2. Proposal
> Add a `truncDateFast` helper paralleling `truncToUnitFast` from SPARK-56663:
> # Resolve the zone offset at `micros` once.
> # Compute the local epoch-day by integer division: {{Math.floorDiv(micros + 
> offsetMicros, MICROS_PER_DAY)}}.
> # Run the existing `truncDate(localDays, level)` (pure integer math for WEEK; 
> one `LocalDate` alloc for MONTH/YEAR).
> # Convert the truncated day back to UTC micros: {{truncatedDays * 
> MICROS_PER_DAY - offsetMicros}}.
> # Verify the offset at the candidate equals the offset at the original
>   (the SPARK-30766 / SPARK-30857 DST guard); fall back to the slow
>   `microsToDays` / `daysToMicros` path if not.
> Also rewrite `TRUNC_TO_QUARTER` from `IsoFields.DAY_OF_QUARTER` (a
> `TemporalAdjuster` that produces a fresh `LocalDate`) to a direct
> `withMonth(firstMonthOfQuarter).withDayOfMonth(1)` chain on the existing
> `LocalDate`. Saves one allocation + the adjuster overhead.
> h2. Benchmark
> `DateTimeBenchmark` Truncation, wholestage on, ns/row, on a 12th Gen Intel
> i7-1260P:
> || level || master baseline || this PR || speedup ||
> | WEEK    | 165.2 | 78.2  | 2.11x |
> | MONTH   | 181.9 | 92.2  | 1.97x |
> | MM      | 182.2 | 92.5  | 1.97x |
> | MON     | 182.9 | 92.7  | 1.97x |
> | QUARTER | 216.8 | 108.8 | 1.99x |
> | YEAR    | 205.2 | 96.7  | 2.12x |
> | YYYY    | 205.8 | 96.9  | 2.12x |
> | YY      | 206.3 | 96.0  | 2.15x |
> Stacked on top of SPARK-56663, the cumulative speedup vs master is the
> same range (since this PR only affects rows SPARK-56663 didn't touch).
> h2. Out of scope
> * `trunc(date, ...)` (date input, no zoneId) -- this PR only changes the
>   `timestamp -> date_trunc` flow. The `TruncDate` expression bypasses
>   `truncTimestamp` entirely; the only change visible to it is the
>   `TRUNC_TO_QUARTER` rewrite (which `trunc(date, ...)` doesn't use in the
>   benchmark today).
> * MICROSECOND / MILLISECOND / SECOND / MINUTE / HOUR / DAY -- handled by
>   SPARK-56663.
> * Per-instance offset cache -- a separate optimization that would amortize
>   the {{rules.getOffset}} call across rows. Would benefit both this PR's
>   and SPARK-56663's paths. Out of scope here.
> * Integer-only calendar arithmetic (Hinnant-style) -- would eliminate the
>   remaining `LocalDate` allocation inside `truncDate` for MONTH/YEAR and
>   push date-level units to the same floor as time-level units. Out of
>   scope here.
> h2. Related
> * SPARK-56663 - introduced the offset-arithmetic fast path for MIN/HR/DAY;
>   this PR extends the same pattern to the date-level units.
> * SPARK-33404 - introduced the slow path that this family of changes is
>   recovering from.
> * SPARK-30766 / SPARK-30857 - the DST-correctness invariants from these
>   fixes are preserved here via the offset-equality guard.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to