joe-ucp opened a new pull request, #8869:
URL: https://github.com/apache/arrow-rs/pull/8869
Here’s a version that fits their template and avoids mentioning Python:
---
# Which issue does this PR close?
* Closes #8806.
# Rationale for this change
The bitwise operations used across `arrow-buffer`, `arrow-arith`, and
`arrow-select` were spread across several overlapping helpers, which made it
difficult to:
* Know which APIs were the “canonical” bitwise primitives, and
* Reason about and fix bugs involving offsets and null bitmaps, especially
in `nullif`.
While working on bitwise kernels, we also found that `nullif` in
`arrow-select` was fragile around sliced arrays, offsets, and null-count
handling. This PR centralizes the bitwise logic and makes `nullif` explicitly
respect the `ArrayData` bitmap layout invariants (length, offset, validity
mapping).
# What changes are included in this PR?
High-level:
* **Documented the bitmap layout contract**
* Added a documentation file (`docs/arraydata_bitmap_layout_contract.md`)
that spells out:
* How `len` and `offset` relate to validity bits,
* How new arrays should be constructed (`offset = 0`),
* And the `nullif` validity rule: `result_valid(i) = V(i) & !C(i)`.
* **Strengthened core bitwise testing in `arrow-buffer` / `arrow-arith`**
* Added/extended tests that:
* Compare new `Buffer`/`MutableBuffer` bitwise APIs against existing
helpers across random data,
* Exercise offsets and tail bits (including `(offset + len)` boundary
cases),
* Check that in-place (`MutableBuffer`) operations produce the same bits
as allocating versions.
* Added `rand` as a **dev-dependency** of `arrow-arith` to support
reproducible randomized tests.
* **Refactored `nullif` to follow the layout contract**
* Introduced a helper:
```rust
fn compute_nullif_validity(
left_valid: &Buffer,
left_offset: usize,
cond_mask: &Buffer,
cond_offset: usize,
len: usize,
) -> Buffer
```
which implements the core `V(i) & !C(i)` validity logic over buffers and
bit offsets.
* Updated `nullif` to:
* Derive correct bit offsets from `ArrayData` for the left validity and
condition mask,
* Always build new result `ArrayData` with `offset = 0`,
* Slice value/offset buffers so element 0 in the result corresponds to
logical index 0 for:
* Primitive arrays,
* Boolean arrays,
* Utf8 / LargeUtf8 arrays,
* Struct arrays (with correct top-level validity propagation).
* **Nullif-focused tests**
* Added a focused unit test for `compute_nullif_validity` that exercises:
* No-offset cases,
* `(offset + len) % 8 == 0` boundaries,
* Cross-byte bit ranges.
* Verified and fixed behavior so that the existing `nullif` tests all
pass, including:
* Offset regressions for ints and booleans,
* String and struct slice tests,
* `null_count` regression tests,
* The existing `nullif_fuzz` test.
Net effect: `nullif` is now a thin consumer of the documented bitmap layout
contract and the consolidated bitwise APIs.
# Are these changes tested?
Yes.
The following were run locally:
```bash
# Core buffer + arithmetic crates
cargo test -p arrow-buffer --lib
cargo test -p arrow-arith --lib
# Select crate, including all nullif tests
cargo test -p arrow-select --lib
cargo test -p arrow-select --lib nullif -- --nocapture
```
All of the above pass, including the previously failing `nullif` regression
and fuzz tests.
# Are there any user-facing changes?
There are **no breaking changes** to public APIs.
User-visible effects:
* `nullif` behavior is now strictly aligned with the documented bitmap
layout contract for:
* Sliced arrays,
* Offsets,
* Null masks,
* Strings and structs.
* This should only be observable as **bug fixes** in corner cases (e.g.,
nulls at specific offsets, struct/string slices), not API changes.
No new configuration or feature flags are introduced; this is primarily a
correctness and maintainability improvement around existing behavior.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]