viirya opened a new pull request, #56072:
URL: https://github.com/apache/spark/pull/56072

   ### What changes were proposed in this pull request?
   
   `VectorizedRleValuesReader` materializes RLE runs of nulls and
   definition levels with degenerate per-element loops:
   
   ```java
   // VectorizedRleValuesReader.java
   for (int k = 0; k < runLen; k++) {
     nulls.putNull(valueOff + k);
   }
   for (int k = 0; k < runLen; k++) {
     defLevels.putInt(levelIdx + k, runValue);
   }
   ```
   
   `WritableColumnVector` already exposes the bulk equivalents
   `putNulls(rowId, count)` and `putInts(rowId, count, value)`. This PR
   switches the three caller sites to the bulk APIs, and reimplements the
   bulk APIs themselves (which were also degenerate loops) using JIT
   intrinsics:
   
   - `OnHeapColumnVector.putNulls` -> `Arrays.fill(byte[], ..., (byte) 1)`
   - `OnHeapColumnVector.putInts(rowId, count, value)` ->
     `Arrays.fill(int[], ..., value)`
   - `OffHeapColumnVector.putNulls` ->
     `Platform.setMemory(addr, (byte) 1, count)`
   
   `Arrays.fill` is backed by HotSpot's `_jbyte_fill` / `_jint_fill`
   intrinsic stubs and `Unsafe.setMemory` lowers to a native memset; both
   are faster than the byte/int loops they replace once `runLen` grows
   beyond a handful of elements.
   
   ### Why are the changes needed?
   
   The bulk-fill APIs on `WritableColumnVector` were the obviously-correct
   calls to make in `VectorizedRleValuesReader`, but their implementations
   were not actually bulk — both the callers and the implementations have
   been small per-element loops.
   
   Measured on Apple M4 Max + OpenJDK 21.0.8 using
   `VectorizedRleValuesReaderBenchmark` (Group C, "Nullable batch decode
   with def-level materialization", 1M rows, BATCH_SIZE=4096), ns/row:
   
   | nullRatio | shape     | baseline | patched | delta  |
   | --------- | --------- | -------: | ------: | -----: |
   | 0.1       | random    | 4.0      | 4.2     | noise  |
   | 0.1       | clustered | 2.8      | 2.7     | +4%    |
   | 0.3       | random    | 6.2      | 6.3     | noise  |
   | 0.3       | clustered | 2.8      | 2.7     | +4%    |
   | 0.5       | random    | 7.1      | 7.1     | 0%     |
   | 0.5       | clustered | 2.8      | 2.6     | +7%    |
   | 0.9       | random    | 3.9      | 3.5     | +10%   |
   | 0.9       | clustered | 2.6      | 2.3     | +12%   |
   
   Gains concentrate on clustered null patterns (long RLE runs), which are
   common in real workloads — sparse dimension columns, ETL-staged nulls,
   time-bucketed missing metrics. Random null patterns produce short runs
   where the bulk-API call cost matches the original loop, hence the
   no-op-to-noise band there.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests; no behavior change. Ran locally:
   
   - `VectorizedRleValuesReaderSuite` (covers the modified caller paths)
   - `ColumnVectorSuite` and `ColumnarBatchSuite` (cover the modified
     `OnHeap/OffHeapColumnVector.putNulls` / `putInts` bulk APIs)
   - `ParquetIOSuite` (end-to-end vectorized reader coverage)
   
   237 tests, all pass.
   
   Benchmark numbers above produced by the existing
   `VectorizedRleValuesReaderBenchmark` (no benchmark changes in this PR).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Claude Opus 4.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to