iemejia opened a new pull request, #55923: URL: https://github.com/apache/spark/pull/55923
### What changes were proposed in this pull request? Replace per-element `readValue` loops with two-pass bulk read + in-place conversion for five `ParquetVectorUpdater` implementations in `ParquetVectorUpdaterFactory`: | Updater | Bulk read | In-place transform | |---|---|---| | `LongAsMicrosUpdater` | `readLongs` | `millisToMicros` | | `LongAsNanosUpdater` | `readLongs` | `microsToNanos` | | `LongAsMicrosRebaseUpdater` | `readLongs` | `millisToMicros` + `rebaseMicros` | | `DateToTimestampNTZUpdater` | `readIntegersAsLongs` | `daysToMicros` | | `DateToTimestampNTZWithRebaseUpdater` | `readIntegersAsLongs` | `rebaseDays` + `daysToMicros` | Each updater now: 1. Bulk-reads raw values into the column vector via `readLongs` or `readIntegersAsLongs` (backed by `System.arraycopy`). 2. Applies the conversion in a tight in-place loop over the column vector. This avoids per-element virtual dispatch through `VectorizedValuesReader` in the hot loop. The `getLong`/`putLong` calls on `final OnHeapColumnVector` are devirtualized by C2 into direct array access. Note: extracting a shared helper taking `LongUnaryOperator` was attempted and reverted because it caused a 4x regression on `LongAsMicrosRebaseUpdater` (1791 -> 434 M/s). The root cause is C2 profile pollution: multiple updaters calling the same static helper with different lambdas makes the `applyAsLong` call site megamorphic, preventing lambda inlining. The explicit inline code is necessary for this hot path. Also adds three missing benchmark cases to `ParquetVectorUpdaterBenchmark`: `LongAsNanosUpdater`, `DateToTimestampNTZWithRebaseUpdater`, `LongAsMicrosRebaseUpdater`. ### Why are the changes needed? The per-element `readValue` loop issues a virtual call to `VectorizedValuesReader.readLong()` / `readInteger()` on every row, which C2 cannot always devirtualize because the reader type varies (PLAIN, RLE, DELTA, etc.). The two-pass approach replaces N virtual calls with a single bulk read (already optimized per reader implementation) followed by a tight scalar loop that C2 can fully inline and optimize. Before/after on the same machine (AMD EPYC 9V45, JDK 17, 1M rows): | Updater | Before (M/s) | After (M/s) | Speedup | |---|---|---|---| | `LongAsMicrosUpdater` | 767 | 2,216 | **2.9x** | | `LongAsNanosUpdater` | (new) | 2,255 | -- | | `DateToTimestampNTZUpdater` | 47 | 56 | **1.2x** | | `DateToTimestampNTZWithRebaseUpdater` | (new) | 55 | -- | | `LongAsMicrosRebaseUpdater` | (new) | 1,791 | -- | The `DateToTimestampNTZ*` updaters show a modest improvement because `daysToMicros` date arithmetic dominates. The long-based updaters show large gains because the transforms (`millisToMicros` = `Math.multiplyExact(x, 1000)`, `microsToNanos` = same) are trivial and the virtual dispatch overhead was the bottleneck. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Existing test suites: `ParquetVectorUpdaterFactorySuite`, `ParquetQuerySuite`, `ParquetIOSuite`, `ParquetSchemaSuite`, `ParquetRebaseDatetimeSuite`, `ParquetEncodingSuite`, `ParquetInteroperabilitySuite`, `ParquetTypeWideningSuite` -- 317 tests pass, 0 failures. - Benchmark: `ParquetVectorUpdaterBenchmark` with three new cases. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: OpenCode with Claude claude-opus-4.6 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
