Dandandan opened a new pull request, #9746:
URL: https://github.com/apache/arrow-rs/pull/9746

   ## Summary
   
   - Replace the short-circuiting `idx_chunk.iter().all(|&i| (i as usize) < 
dict_len)` in the bit-packed hot loop of `RleDecoder::get_batch_with_dict` with 
a u32 max-reduction. `.all` blocks autovectorisation; `fold(0u32, |acc, &i| 
acc.max(i as u32))` has no early exit, so LLVM lowers the check to a single 
SIMD max-reduction and reuses the loaded registers for the gather that follows.
   - Adds `parquet/benches/rle_dict.rs`, a small targeted Criterion bench that 
drives `get_batch_with_dict` directly (i32 and `String` dictionaries, sizes 
16/256/1024, 8192 values per batch).
   
   ## Why
   
   On aarch64 the old code compiled to eight serialised `ldrsw` + `cmp` + 
`b.ls` pairs per 8-index chunk, followed by eight separate scalar gather loads 
— one lane at a time. After the change the bounds check is one SIMD reduction:
   
   ```
   ldp     q1, q0, [x11], #0x20    ; load 8 indices
   umax.4s v2, v1, v0              ; lane-wise max
   umaxv.4s s2, v2                 ; horizontal max
   fmov    w13, s2
   cmp     x20, x13                ; one bounds check
   b.ls    <panic>
   ```
   
   and `v1 / v0` are then reused for the gather, avoiding the reloads.
   
   Negative `i32` values cast to `u32` become large, so the check still rejects 
them.
   
   ## Measurements
   
   Apple Silicon (aarch64), `cargo bench --bench rle_dict`:
   
   | case                | before      | after       | Δ     |
   |---------------------|-------------|-------------|-------|
   | str/dict=16         | 59.48 µs    | 57.90 µs    | −2.6% |
   | i32/dict=16         | 3.28 µs     | 3.34 µs     | noise |
   | str/dict=256        | 48.72 µs    | 47.96 µs    | noise |
   | i32/dict=256        | 3.33 µs     | 3.21 µs     | −3.3% |
   | str/dict=1024       | 34.29 µs    | 33.01 µs    | −4.2% |
   | i32/dict=1024       | 3.79 µs     | 3.74 µs     | −1.7% |
   
   ## Test plan
   
   - [x] `cargo test -p parquet --lib -- encodings::rle::`
   - [x] `cargo bench -p parquet --bench rle_dict --features experimental` 
(results above)
   - [ ] Verify CI
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to