adriangb opened a new pull request, #9510:
URL: https://github.com/apache/arrow-rs/pull/9510

   ## Which issue does this PR close?
   
   Follow-up optimization to #7748 which added `has_true()`/`has_false()` to 
`BooleanArray`.
   
   ## Rationale
   
   The initial implementation used `BitChunks::iter_padded()` which yields one 
`u64` at a time through an opaque chained iterator that prevents LLVM 
auto-vectorization. Benchmarks showed the worst case (full scan, 65K elements) 
at ~255ns vs ~100ns for `true_count()` which uses `UnalignedBitChunk` — an 
aligned `&[u64]` slice that LLVM vectorizes.
   
   ## What changes are included in this PR?
   
   Switch the **no-nulls** paths of `has_true()` and `has_false()` from 
`BitChunks::iter_padded()` to `UnalignedBitChunk`. Process the aligned 
`chunks()` slice in blocks of 64 u64s (4096 bits), using a fold within each 
block that LLVM can auto-vectorize, and short-circuiting between blocks.
   
   - `has_true()`: OR-fold (`acc | c`) per block, short-circuit when `!= 0`
   - `has_false()`: AND-fold (`acc & c`) per block, short-circuit when `!= 
u64::MAX`
   
   The **with-nulls** paths are unchanged — the two buffers may have different 
alignment making a shared `&[u64]` zip difficult.
   
   ## Benchmark results (65536 elements)
   
   | Benchmark | Before | After | Speedup |
   |---|---|---|---|
   | `has_true(all_false)` — worst case full scan | ~255ns | **49ns** | 
**5.2x** |
   | `has_false(all_true)` — worst case full scan | ~255ns | **50ns** | 
**5.1x** |
   | `has_true(all_true)` — short-circuit | ~6ns | 6ns | — |
   | `has_false(all_false)` — short-circuit | ~6ns | 6ns | — |
   | `true_count` (for reference) | ~101ns | 101ns | — |
   
   The worst-case full scan is now ~2x faster than `true_count()` because the 
fold uses a simpler per-element operation (OR/AND vs popcount+sum).
   
   ## Potential DataFusion callsites
   
   These methods are new and not yet adopted in DataFusion, but several 
existing patterns could benefit:
   
   - `datafusion/functions-nested/src/array_has.rs:345` — 
`eq_array.true_count() > 0` → `eq_array.has_true()`
   - `datafusion/datasource-parquet/src/metadata.rs:615` — 
`combined_mask.true_count() > 0` → `combined_mask.has_true()`
   - `datafusion/physical-plan/src/joins/nested_loop_join.rs:1663,2250` — 
`bitmap.true_count() == 0` → `!bitmap.has_true()`
   - `datafusion/functions-nested/src/replace.rs:372` — `eq_array.false_count() 
== eq_array.len()` → `!eq_array.has_true()`
   - `datafusion/datasource-parquet/src/metadata.rs:495,514` — 
`exactness.true_count() == exactness.len()` → `!exactness.has_false()`
   - `datafusion/datasource-parquet/src/metadata.rs:498,517` — 
`exactness.true_count() == 0` → `!exactness.has_true()`
   
   ## Are these changes tested?
   
   Yes — all existing tests pass (`cargo test -p arrow-array -- 
boolean_array`), including non-aligned and edge case tests added in #7748.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to