iemejia opened a new issue, #56011: URL: https://github.com/apache/spark/issues/56011
## Overview This is an umbrella issue tracking a series of performance improvements to the Parquet vectorized reader in Spark SQL. The changes target allocation reduction, bulk-read optimizations, and JIT-friendly code patterns across multiple encoding paths. All PRs are independent and can be reviewed/merged in any order. Together they yield significant throughput gains (1.2x to 7x depending on the encoding and data shape) for Parquet reads with no user-facing behavioral changes. ## Pull Requests ### 1. DELTA_BINARY_PACKED bulk read optimization **PR:** #55919 ([SPARK-56892](https://issues.apache.org/jira/browse/SPARK-56892)) Replaces per-element lambda dispatch in `readIntegers`/`readLongs` with bulk paths that compute prefix sums in-place and write via `putInts`/`putLongs`. Also eliminates 3 allocations per value in `readUnsignedLongs` by replacing `BigInteger(Long.toUnsignedString(v))` with a reusable `ByteBuffer`. | Type | Speedup | |------|---------| | INT32 (monotonic) | 1.4x | | INT64 (monotonic) | 3.8x | | readUnsignedLongs | 7.2x | --- ### 2. Dictionary decoding hasNull fast path + per-class updater overrides **PR:** #55920 ([SPARK-56893](https://issues.apache.org/jira/browse/SPARK-56893)) Adds a `hasNull()` fast path that skips per-element null checks when the column has no nulls (common case). Per-class `decodeDictionaryIds` overrides give C2 monomorphic call sites, enabling full inlining of type-specific decode expressions. | Scenario | Speedup | |----------|---------| | No nulls (avg across 6 updaters) | 1.24x | --- ### 3. Vectorized BYTE_STREAM_SPLIT reader **PR:** #55921 ([SPARK-56894](https://issues.apache.org/jira/browse/SPARK-56894)) Adds a new `VectorizedByteStreamSplitValuesReader` that decodes BSS-encoded pages (FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY) using batch byte-gathering instead of falling back to parquet-mr per-value reads. | Type | Speedup vs parquet-mr | |------|-----------------------| | INT32 | 4.5x | | INT64 | 2.8x | | FLOAT | 4.2x | | DOUBLE | 2.8x | --- ### 4. Batch ByteBuffer slice in RLE PACKED decode **PR:** #55922 ([SPARK-56895](https://issues.apache.org/jira/browse/SPARK-56895)) Replaces per-group `in.slice(bitWidth)` (one `ByteBuffer` allocation per 8 values) with a single bulk slice for the entire PACKED run. Eliminates ~128K short-lived ByteBuffer allocations per 1M-value page. | bitWidth | Speedup (readIntegers) | |----------|------------------------| | 4 | 2.1x | | 8 | 2.4x | | 12 | 1.6x | | 20 | 1.4x | --- ### 5. Bulk read paths for timestamp/date Parquet vector updaters **PR:** #55923 ([SPARK-56896](https://issues.apache.org/jira/browse/SPARK-56896)) Replaces per-element `readValue` loops with two-pass bulk read + in-place conversion for five updaters (`LongAsMicrosUpdater`, `LongAsNanosUpdater`, `LongAsMicrosRebaseUpdater`, `DateToTimestampNTZUpdater`, `DateToTimestampNTZWithRebaseUpdater`). Avoids per-element virtual dispatch through `VectorizedValuesReader`. | Updater | Speedup | |---------|---------| | LongAsMicrosUpdater | 2.9x | | DateToTimestampNTZUpdater | 1.2x | --- ### 6. Reduce per-value allocations in DELTA_BYTE_ARRAY decoder **PR:** #55924 ([SPARK-56897](https://issues.apache.org/jira/browse/SPARK-56897)) Replaces `ByteBuffer`-based state tracking with a reusable `byte[]` buffer, eliminating 2 ByteBuffer allocations per decoded value (~8K objects per 4096-value page). Also rewrites `skipBinary` to avoid column vector reset/swap overhead. | Operation | Speedup | |-----------|---------| | readBinary | 1.1-1.3x | | skipBinary | 1.5-1.9x | --- ### 7. Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY decoder **PR:** #55932 ([SPARK-56907](https://issues.apache.org/jira/browse/SPARK-56907)) Replaces per-value `in.slice(length)` with a single bulk slice for the entire batch. Replaces per-value skip loop with a single bulk skip. | Operation | Speedup | |-----------|---------| | readBinary (small payloads) | 1.2x | | skipBinary | 1.4x | --- ## Common Themes - **Allocation reduction**: Replace per-value `ByteBuffer.slice()` / `ByteBuffer.wrap()` with bulk reads into reusable buffers - **Bulk vectorized reads**: Replace per-element virtual dispatch with single batch calls backed by `System.arraycopy` - **JIT-friendly patterns**: Per-class method overrides for monomorphic call sites; avoiding megamorphic profile pollution from shared helpers ## Benchmarking All benchmarks were run on AMD EPYC 9V45 with OpenJDK 17/25, comparing upstream `master` against the patched version on the same machine with identical JVM flags. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
