adriangb opened a new pull request, #9511:
URL: https://github.com/apache/arrow-rs/pull/9511
## Motivation
When working with `BooleanArray`, a common pattern is checking whether *any*
true or false value exists — e.g.
`arr.true_count() > 0` or `arr.false_count() == 0`. This currently requires
`true_count()` / `false_count()`, which scan the **entire** bitmap to count
every set bit (via `popcount`), even though we only need to know if at least
one exists.
This PR adds `has_true()` and `has_false()` methods that short-circuit as
soon as they find a matching value, providing both:
1. **Better performance** — faster on large arrays in the best case
2. **More ergonomic API** — `arr.has_true()` expresses intent more clearly
than `arr.true_count() > 0`
## Callsites in DataFusion
There are several places in DataFusion that would benefit from these methods:
- **`datafusion/functions-nested/src/array_has.rs`** —
`eq_array.true_count() > 0` → `eq_array.has_true()`
- **`datafusion/physical-plan/src/topk/mod.rs`** — `filter.true_count() ==
0` check → `!filter.has_true()`
- **`datafusion/datasource-parquet/src/metadata.rs`** —
`exactness.true_count() == 0` and `combined_mask.true_count() > 0`
- **`datafusion/physical-plan/src/joins/nested_loop_join.rs`** —
`bitmap.true_count() == 0` checks
- **`datafusion/physical-expr-common/src/physical_expr.rs`** —
`selection_count == 0` from `selection.true_count()`
- **`datafusion/physical-expr/src/expressions/binary.rs`** — short-circuit
checks for AND/OR
## Benchmark Results
```
Scenario true_count has_true has_false
Speedup (best)
─────────────────────────────────────────────────────────────────────────────────────────────
all_true, 64 4.32 ns 4.08 ns 4.76 ns
~1.1x
all_false, 64 4.30 ns 4.25 ns 4.52 ns
~1.0x
all_true, 1024 5.15 ns 4.52 ns 4.99 ns
~1.1x
all_false, 1024 5.17 ns 4.55 ns 5.00 ns
~1.1x
mixed_early, 1024 5.22 ns — 5.04 ns
~1.0x
nulls_all_true, 1024 12.84 ns 4.10 ns 12.92 ns
~3.1x
all_true, 65536 100.06 ns 5.96 ns 49.70 ns
~16.8x (has_true)
all_false, 65536 99.33 ns 49.30 ns 6.19 ns
~16.0x (has_false)
mixed_early, 65536 100.10 ns — 6.20 ns
~16.1x (has_false)
nulls_all_true, 65536 522.94 ns 4.05 ns 521.82 ns
~129x (has_true)
```
The key wins are on larger arrays (65,536 elements), where
`has_true`/`has_false` are **up to 16-129x faster** than
`true_count()` in best-case scenarios (early short-circuit). Even in worst
case (must scan entire array), performance is
comparable to `true_count`.
## Implementation
The implementation processes bits in 64-bit chunks using
`UnalignedBitChunk`, which handles arbitrary bit offsets and aligns
data for SIMD-friendly processing.
- **`has_true` (no nulls):** OR-folds 64-bit chunks, short-circuits when any
bit is set
- **`has_false` (no nulls):** AND-folds 64-bit chunks, short-circuits when
any bit is unset (with padding bits masked to 1)
- **With nulls:** Iterates paired `(null, value)` chunks, checking `null &
value != 0` (has_true) or `null & !value != 0`
(has_false)
### Alternatives considered
1. **Fully vectorized (no early stopping):** Would process the entire bitmap
like `true_count()` but with simpler bitwise ops
instead of popcount. Marginally faster than `true_count()` but misses the
main optimization opportunity.
2. **Per-element iteration with early stopping:** `self.iter().any(|v| v ==
Some(true))`. Simple but processes one bit at a
time, missing SIMD vectorization of the inner loop. Our approach processes
64 bits at a time while still supporting early
exit.
The chosen approach balances SIMD-friendly bulk processing (64 bits per
iteration) with early termination, giving the best of
both worlds.
## Test Plan
- Unit tests covering: all-true, all-false, mixed, empty, nullable
(all-valid-true, all-valid-false, all-null), non-aligned
lengths (65 elements, 64+1 with trailing false)
- Criterion benchmarks comparing `has_true`/`has_false` vs `true_count`
across sizes (64, 1024, 65536) and data distributions
🤖 Generated with [Claude Code](https://claude.com/claude-code
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]