[PR] [SPARK-56896][SQL] Add bulk read paths for timestamp/date Parquet vector updaters [spark]

via GitHub Sat, 16 May 2026 17:11:28 -0700


iemejia opened a new pull request, #55923:
URL: https://github.com/apache/spark/pull/55923


   ### What changes were proposed in this pull request?
   
   Replace per-element `readValue` loops with two-pass bulk read + in-place 
conversion for five `ParquetVectorUpdater` implementations in 
`ParquetVectorUpdaterFactory`:
   
   | Updater | Bulk read | In-place transform |
   |---|---|---|
   | `LongAsMicrosUpdater` | `readLongs` | `millisToMicros` |
   | `LongAsNanosUpdater` | `readLongs` | `microsToNanos` |
   | `LongAsMicrosRebaseUpdater` | `readLongs` | `millisToMicros` + 
`rebaseMicros` |
   | `DateToTimestampNTZUpdater` | `readIntegersAsLongs` | `daysToMicros` |
   | `DateToTimestampNTZWithRebaseUpdater` | `readIntegersAsLongs` | 
`rebaseDays` + `daysToMicros` |
   
   Each updater now:
   1. Bulk-reads raw values into the column vector via `readLongs` or 
`readIntegersAsLongs` (backed by `System.arraycopy`).
   2. Applies the conversion in a tight in-place loop over the column vector.
   
   This avoids per-element virtual dispatch through `VectorizedValuesReader` in 
the hot loop. The `getLong`/`putLong` calls on `final OnHeapColumnVector` are 
devirtualized by C2 into direct array access.
   
   Note: extracting a shared helper taking `LongUnaryOperator` was attempted 
and reverted because it caused a 4x regression on `LongAsMicrosRebaseUpdater` 
(1791 -> 434 M/s). The root cause is C2 profile pollution: multiple updaters 
calling the same static helper with different lambdas makes the `applyAsLong` 
call site megamorphic, preventing lambda inlining. The explicit inline code is 
necessary for this hot path.
   
   Also adds three missing benchmark cases to `ParquetVectorUpdaterBenchmark`: 
`LongAsNanosUpdater`, `DateToTimestampNTZWithRebaseUpdater`, 
`LongAsMicrosRebaseUpdater`.
   
   ### Why are the changes needed?
   
   The per-element `readValue` loop issues a virtual call to 
`VectorizedValuesReader.readLong()` / `readInteger()` on every row, which C2 
cannot always devirtualize because the reader type varies (PLAIN, RLE, DELTA, 
etc.). The two-pass approach replaces N virtual calls with a single bulk read 
(already optimized per reader implementation) followed by a tight scalar loop 
that C2 can fully inline and optimize.
   
   Before/after on the same machine (AMD EPYC 9V45, JDK 17, 1M rows):
   
   | Updater | Before (M/s) | After (M/s) | Speedup |
   |---|---|---|---|
   | `LongAsMicrosUpdater` | 767 | 2,216 | **2.9x** |
   | `LongAsNanosUpdater` | (new) | 2,255 | -- |
   | `DateToTimestampNTZUpdater` | 47 | 56 | **1.2x** |
   | `DateToTimestampNTZWithRebaseUpdater` | (new) | 55 | -- |
   | `LongAsMicrosRebaseUpdater` | (new) | 1,791 | -- |
   
   The `DateToTimestampNTZ*` updaters show a modest improvement because 
`daysToMicros` date arithmetic dominates. The long-based updaters show large 
gains because the transforms (`millisToMicros` = `Math.multiplyExact(x, 1000)`, 
`microsToNanos` = same) are trivial and the virtual dispatch overhead was the 
bottleneck.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   - Existing test suites: `ParquetVectorUpdaterFactorySuite`, 
`ParquetQuerySuite`, `ParquetIOSuite`, `ParquetSchemaSuite`, 
`ParquetRebaseDatetimeSuite`, `ParquetEncodingSuite`, 
`ParquetInteroperabilitySuite`, `ParquetTypeWideningSuite` -- 317 tests pass, 0 
failures.
   - Benchmark: `ParquetVectorUpdaterBenchmark` with three new cases.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: OpenCode with Claude claude-opus-4.6


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56896][SQL] Add bulk read paths for timestamp/date Parquet vector updaters [spark]

Reply via email to