This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git
The following commit(s) were added to refs/heads/main by this push:
new 73c513afd3 Bulk-fill definition levels for majority-null leaf columns
(#9967)
73c513afd3 is described below
commit 73c513afd36b0c6a3b41ef9dc135aee568184255
Author: RyanStewart <[email protected]>
AuthorDate: Wed May 20 11:49:47 2026 -0700
Bulk-fill definition levels for majority-null leaf columns (#9967)
## Which issue does this PR close?
- Contributes to #9731.
## AI assistance
Implementation drafted with AI assistance and iterated against the
benchmarks below. I've reviewed and own the code, including the gate
threshold which I picked from the sweep in [Threshold
(`BULK_FILL_MIN_LEN`)](#threshold-bulk_fill_min_len). Per the project's
[CONTRIBUTING guidance on AI-generated
submissions](https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md#ai-generated-submissions).
## Rationale for this change
When writing a nullable leaf (primitive) Arrow array, `write_leaf`
builds the definition-level buffer one element at a time, mapping each
null bit to a level. For columns that are mostly null this does
~`num_rows` of branchy work and allocates a `num_rows`-element level
buffer even though almost every produced level is the same value. #9954
adds an O(1) fast path for the *entirely* null case; this PR covers the
*sparse* (mostly-but-not-entirely null) case it doesn't handle, the
literal subject of #9731 ("a column that is 99% null … ~100x more work
than necessary").
## What changes are included in this PR?
A single popcount pass over the null mask
(`Buffer::count_set_bits_offset`, O(`num_rows`/64)) counts the valid
values in the range. When the slice is majority-null, the
definition-level buffer is bulk-filled with the null level (a vectorized
`Vec::resize` memset) and only the non-null positions (from
`NullBuffer::valid_indices()`) are overwritten. The existing per-row
path is kept for non-majority-null slices, so balanced and null-light
columns are unaffected. Both branches share the same `let range_nulls =
nulls.slice(range.start, len)` slicing idiom; the slow path uses
`range_nulls.iter()` for the def-level map and
`range_nulls.valid_indices().map(|i| i + range.start)` for
`non_null_indices`, with no `unsafe`. Output is byte-identical: the
level *values* are unchanged, just produced via memset+scatter (fast
path) or via the high-level `NullBuffer` iterators (slow path) instead
of a manual `BitIndexIterator` walk.
## Threshold (`BULK_FILL_MIN_LEN`)
The bulk-fill fast path is gated on two conditions:
- `len >= BULK_FILL_MIN_LEN` (currently 64). Per-call
slice/popcount/iterator overhead only amortizes on sizable sub-ranges.
List/struct paths call `write_leaf` many times with tiny ranges (avg
list length 1-5); paying any per-call popcount there would regress them.
A threshold sweep at T = {0, 16, 32, 64, 128, 256} on Ryzen 9 9950X
shows the regression floor settles by T=32, and the choice of 64 gives
~12x margin over the average list length without losing the
flat-primitive wins.
- `nulls.null_count() * 2 >= nulls.len()`. The cached `null_count()` is
O(1), so this check is free. We use the buffer-wide density as a
heuristic for the sub-range; for full-array writes (the primary target,
flat primitive columns) it's exact.
Even when the gate skips the fast path, evaluating it across
high-frequency call sites (~10K calls in some list benchmarks) is a
small structural cost (~1-2% on list-sparse cases). The wins on the
targeted shapes (-35% sparse-primitive, -66% all-null primitive) far
outweigh that. Reducing the cost further would require hoisting the
decision into the caller.
## Are these changes tested?
Existing tests cover this path: `cargo test -p parquet --features arrow
--lib arrow_writer` is green (136 tests, full of nulls and roundtrips);
full `cargo test -p parquet --features arrow` green modulo the
pre-existing `PARQUET_TEST_DATA` submodule failures (unrelated, same on
`main`). `cargo clippy -p parquet --features arrow --lib` and `cargo fmt
--check` clean. The `unsafe get_unchecked_mut` flagged in the original
revision was replaced via `NullBuffer::valid_indices()`; the slow-path
also dropped its `unsafe value_unchecked` for the same reason.
## Are there any user-facing changes?
None.
## Benchmarks
`cargo bench -p parquet --bench arrow_writer`, 1M rows × 7 nullable
primitive columns, local Ryzen 9 9950X:
```
primitive_sparse_99pct_null/default 11.88 ms -> 9.13 ms (-23%) <- the
case #9731 calls out
primitive_all_null/default 5.65 ms -> 2.33 ms (-59%)
(subsumed by #9954's O(1) path if that lands first)
struct_sparse_99pct_null/default 5.67 ms -> 5.32 ms (-6%)
struct_all_null/default 1.52 ms -> 1.31 ms (-14%)
list_primitive_sparse_99pct_null, primitive (25% null), primitive_non_null,
bool, string: within noise (no regression)
```
The CI benchmark bot (GKE `c4a-highmem-16`, Neoverse-V2) on the
post-fixup revision shows the same shape with stronger relative wins on
the targeted cases:
```
primitive_all_null/default 2.47x (11.0ms -> 4.4ms)
primitive_sparse_99pct_null/default 1.60x (16.8ms -> 10.5ms)
primitive_all_null/{bloom_filter,cdc,parquet_2,zstd,zstd_parquet_2}
1.38x to 2.48x
primitive_sparse_99pct_null/{...} 1.28x to 1.59x
list_primitive*, list_primitive_sparse_99pct_null*:
1.00x to 1.01x (within noise)
```
Microbench of the definition-level fill in isolation: 10.3x @ 100%-null,
8.6x @ 99%, 5.2x @ 90%, 1.9x @ 50%, 0.93x @ 10%, 0.81x @ 0%. Crossover ≈
12-15% null, clean win above ~25%; the `>= 50% null` guard is
conservative.
This is the *materialization*-cost half of #9731 (~30% of the 99%-null
write); the *walk*-cost half, a run-length input to the level encoder so
the column writer doesn't even iterate all `num_rows` levels, is the
larger structural change #9653 is heading toward. This PR is
deliberately small and isolated so it lands independently of and rebases
cleanly under that work.
---------
Co-authored-by: Ryan Stewart <[email protected]>
---
parquet/src/arrow/arrow_writer/levels.rs | 52 +++++++++++++++++++++++++-------
1 file changed, 41 insertions(+), 11 deletions(-)
diff --git a/parquet/src/arrow/arrow_writer/levels.rs
b/parquet/src/arrow/arrow_writer/levels.rs
index 73f9221376..10f90f707c 100644
--- a/parquet/src/arrow/arrow_writer/levels.rs
+++ b/parquet/src/arrow/arrow_writer/levels.rs
@@ -153,6 +153,13 @@ enum LevelInfoBuilder {
Struct(Vec<LevelInfoBuilder>, LevelContext, Option<NullBuffer>),
}
+/// Minimum sub-range length before the bulk-fill fast path in `write_leaf`
+/// becomes profitable for null-heavy leaf columns. Below this, per-call
+/// slice + popcount overhead regresses list/struct paths that call
+/// `write_leaf` many times with tiny ranges. Picked via threshold sweep;
+/// see <https://github.com/apache/arrow-rs/pull/9967> for the rationale.
+const BULK_FILL_MIN_LEN: usize = 64;
+
impl LevelInfoBuilder {
/// Create a new [`LevelInfoBuilder`] for the given [`Field`] and parent
[`LevelContext`]
fn try_new(field: &Field, parent_ctx: LevelContext, array: &ArrayRef) ->
Result<Self> {
@@ -676,20 +683,43 @@ impl LevelInfoBuilder {
match &info.logical_nulls {
Some(nulls) => {
assert!(range.end <= nulls.len());
- let nulls = nulls.inner();
- info.def_levels.extend_from_iter(range.clone().map(|i| {
- // Safety: range.end was asserted to be in bounds
earlier
- let valid = unsafe { nulls.value_unchecked(i) };
- max_def_level - (!valid as i16)
- }));
- info.non_null_indices.reserve(len);
- info.non_null_indices.extend(
- BitIndexIterator::new(nulls.inner(), nulls.offset() +
range.start, len)
- .map(|i| i + range.start),
- );
+ // Bulk-fill is profitable only on null-heavy ranges long
enough to
+ // amortize the slice/popcount cost; see
`BULK_FILL_MIN_LEN` and the
+ // PR description for the threshold sweep. The gate uses
the cached
+ // buffer-wide `null_count` (O(1)) to stay cheap on the
cold path.
+ if len >= BULK_FILL_MIN_LEN && nulls.null_count() * 2 >=
nulls.len() {
+ let range_nulls = nulls.slice(range.start, len);
+ let valid_in_range = len - range_nulls.null_count();
+ let null_def_level = max_def_level - 1;
+ let buf = info
+ .def_levels
+ .materialize_mut()
+ .expect("definition levels present");
+ let base = buf.len();
+ buf.resize(base + len, null_def_level);
+ for i in range_nulls.valid_indices() {
+ buf[base + i] = max_def_level;
+ }
+ info.non_null_indices.reserve(valid_in_range);
+ info.non_null_indices
+ .extend(range_nulls.valid_indices().map(|i| i +
range.start));
+ } else {
+ let bits = nulls.inner();
+ info.def_levels.extend_from_iter(range.clone().map(|i|
{
+ // Safety: range.end was asserted to be in bounds
earlier
+ let valid = unsafe { bits.value_unchecked(i) };
+ max_def_level - (!valid as i16)
+ }));
+ info.non_null_indices.reserve(len);
+ info.non_null_indices.extend(
+ BitIndexIterator::new(bits.inner(), bits.offset()
+ range.start, len)
+ .map(|i| i + range.start),
+ );
+ }
}
None => {
info.append_def_level_run(max_def_level, len);
+ info.non_null_indices.reserve(len);
info.non_null_indices.extend(range.clone());
}
}