[PR] parquet: speed up ByteView dictionary decoder with chunks_exact gather (~28%) [arrow-rs]

via GitHub Thu, 16 Apr 2026 11:10:47 -0700


Dandandan opened a new pull request, #9745:
URL: https://github.com/apache/arrow-rs/pull/9745


   ## Which issue does this PR close?
   
   None yet — targeted optimisation surfaced by profiling `profile_clickbench` 
locally.
   
   ## Rationale for this change
   
   `ByteViewArrayDecoderDictionary::read` was the hot inner loop for reading
   dictionary-encoded `StringView` / `BinaryView` columns. Its existing shape 
was:
   
   ```rust
   output.views.extend(keys.iter().map(|k| match dict.views.get(*k as usize) {
       Some(&view) => view,
       None => {
           if error.is_none() {
               error = Some(general_err!(\"invalid key={} for dictionary\", 
*k));
           }
           0
       }
   }));
   ```
   
   Per element this pays:
   - a bounds-checked `get`,
   - a `Some`/`None` branch,
   - an `error.is_none()` branch (even on the happy path),
   - `Vec::extend`'s per-push capacity check — `Map<_, closure>` doesn't get 
the `TrustedLen` specialisation.
   
   ## What changes are included in this PR?
   
   Rewrite the inner loop with a `chunks_exact(8)` gather: bulk-validate each
   chunk's keys with a single `all(|&k| (k as usize) < dict_len)` (which the
   compiler auto-vectorises), then use `get_unchecked` and raw-pointer writes.
   Mirrors the pattern already used in `RleDecoder::get_batch_with_dict`.
   
   Invalid keys now return an error eagerly via a `#[cold]` helper instead of
   zero-filling and deferring the error to end of batch.
   
   ## Are these changes tested?
   
   Existing unit tests in `byte_view_array` pass 
(`test_byte_array_string_view_decoder`,
   `test_byte_view_array_plain_decoder_reuse_buffer`).
   
   Microbenchmarks (`parquet/benches/arrow_reader.rs`, 
`arrow_array_reader/(String|Binary)ViewArray/dictionary *`):
   
   | Bench | Before | After | Δ |
   |---|---|---|---|
   | BinaryView mandatory, no NULLs | 102.91 µs | 74.29 µs | **−27.8%** |
   | BinaryView optional, no NULLs | 104.63 µs | 76.65 µs | **−26.9%** |
   | BinaryView optional, half NULLs | 143.25 µs | 132.46 µs | −7.3% |
   | StringView mandatory, no NULLs | 105.98 µs | 73.87 µs | **−28.8%** |
   | StringView optional, no NULLs | 104.62 µs | 76.34 µs | **−27.4%** |
   | StringView optional, half NULLs | 141.86 µs | 131.85 µs | −7.1% |
   
   Half-NULL cases gain less because roughly half the views end up being null 
padding rather than gather output.
   
   ## Are there any user-facing changes?
   
   None — same public API, same semantics (invalid indices still surface as
   `ParquetError::General`).
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] parquet: speed up ByteView dictionary decoder with chunks_exact gather (~28%) [arrow-rs]

Reply via email to