(arrow-rs) branch main updated: Bulk-fill definition levels for majority-null leaf columns (#9967)

alamb Wed, 20 May 2026 11:50:01 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git



The following commit(s) were added to refs/heads/main by this push:
     new 73c513afd3 Bulk-fill definition levels for majority-null leaf columns 
(#9967)
73c513afd3 is described below

commit 73c513afd36b0c6a3b41ef9dc135aee568184255
Author: RyanStewart <[email protected]>
AuthorDate: Wed May 20 11:49:47 2026 -0700

    Bulk-fill definition levels for majority-null leaf columns (#9967)
    
    ## Which issue does this PR close?
    
    - Contributes to #9731.
    
    ## AI assistance
    
    Implementation drafted with AI assistance and iterated against the
    benchmarks below. I've reviewed and own the code, including the gate
    threshold which I picked from the sweep in [Threshold
    (`BULK_FILL_MIN_LEN`)](#threshold-bulk_fill_min_len). Per the project's
    [CONTRIBUTING guidance on AI-generated
    
submissions](https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md#ai-generated-submissions).
    
    ## Rationale for this change
    
    When writing a nullable leaf (primitive) Arrow array, `write_leaf`
    builds the definition-level buffer one element at a time, mapping each
    null bit to a level. For columns that are mostly null this does
    ~`num_rows` of branchy work and allocates a `num_rows`-element level
    buffer even though almost every produced level is the same value. #9954
    adds an O(1) fast path for the *entirely* null case; this PR covers the
    *sparse* (mostly-but-not-entirely null) case it doesn't handle, the
    literal subject of #9731 ("a column that is 99% null … ~100x more work
    than necessary").
    
    ## What changes are included in this PR?
    
    A single popcount pass over the null mask
    (`Buffer::count_set_bits_offset`, O(`num_rows`/64)) counts the valid
    values in the range. When the slice is majority-null, the
    definition-level buffer is bulk-filled with the null level (a vectorized
    `Vec::resize` memset) and only the non-null positions (from
    `NullBuffer::valid_indices()`) are overwritten. The existing per-row
    path is kept for non-majority-null slices, so balanced and null-light
    columns are unaffected. Both branches share the same `let range_nulls =
    nulls.slice(range.start, len)` slicing idiom; the slow path uses
    `range_nulls.iter()` for the def-level map and
    `range_nulls.valid_indices().map(|i| i + range.start)` for
    `non_null_indices`, with no `unsafe`. Output is byte-identical: the
    level *values* are unchanged, just produced via memset+scatter (fast
    path) or via the high-level `NullBuffer` iterators (slow path) instead
    of a manual `BitIndexIterator` walk.
    
    ## Threshold (`BULK_FILL_MIN_LEN`)
    
    The bulk-fill fast path is gated on two conditions:
    
    - `len >= BULK_FILL_MIN_LEN` (currently 64). Per-call
    slice/popcount/iterator overhead only amortizes on sizable sub-ranges.
    List/struct paths call `write_leaf` many times with tiny ranges (avg
    list length 1-5); paying any per-call popcount there would regress them.
    A threshold sweep at T = {0, 16, 32, 64, 128, 256} on Ryzen 9 9950X
    shows the regression floor settles by T=32, and the choice of 64 gives
    ~12x margin over the average list length without losing the
    flat-primitive wins.
    - `nulls.null_count() * 2 >= nulls.len()`. The cached `null_count()` is
    O(1), so this check is free. We use the buffer-wide density as a
    heuristic for the sub-range; for full-array writes (the primary target,
    flat primitive columns) it's exact.
    
    Even when the gate skips the fast path, evaluating it across
    high-frequency call sites (~10K calls in some list benchmarks) is a
    small structural cost (~1-2% on list-sparse cases). The wins on the
    targeted shapes (-35% sparse-primitive, -66% all-null primitive) far
    outweigh that. Reducing the cost further would require hoisting the
    decision into the caller.
    
    ## Are these changes tested?
    
    Existing tests cover this path: `cargo test -p parquet --features arrow
    --lib arrow_writer` is green (136 tests, full of nulls and roundtrips);
    full `cargo test -p parquet --features arrow` green modulo the
    pre-existing `PARQUET_TEST_DATA` submodule failures (unrelated, same on
    `main`). `cargo clippy -p parquet --features arrow --lib` and `cargo fmt
    --check` clean. The `unsafe get_unchecked_mut` flagged in the original
    revision was replaced via `NullBuffer::valid_indices()`; the slow-path
    also dropped its `unsafe value_unchecked` for the same reason.
    
    ## Are there any user-facing changes?
    
    None.
    
    ## Benchmarks
    
    `cargo bench -p parquet --bench arrow_writer`, 1M rows × 7 nullable
    primitive columns, local Ryzen 9 9950X:
    
    ```
    primitive_sparse_99pct_null/default   11.88 ms -> 9.13 ms   (-23%)   <- the 
case #9731 calls out
    primitive_all_null/default             5.65 ms -> 2.33 ms   (-59%)   
(subsumed by #9954's O(1) path if that lands first)
    struct_sparse_99pct_null/default       5.67 ms -> 5.32 ms   (-6%)
    struct_all_null/default                1.52 ms -> 1.31 ms   (-14%)
    list_primitive_sparse_99pct_null, primitive (25% null), primitive_non_null, 
bool, string:  within noise (no regression)
    ```
    
    The CI benchmark bot (GKE `c4a-highmem-16`, Neoverse-V2) on the
    post-fixup revision shows the same shape with stronger relative wins on
    the targeted cases:
    
    ```
    primitive_all_null/default              2.47x (11.0ms -> 4.4ms)
    primitive_sparse_99pct_null/default     1.60x (16.8ms -> 10.5ms)
    primitive_all_null/{bloom_filter,cdc,parquet_2,zstd,zstd_parquet_2}    
1.38x to 2.48x
    primitive_sparse_99pct_null/{...}        1.28x to 1.59x
    list_primitive*, list_primitive_sparse_99pct_null*:                    
1.00x to 1.01x (within noise)
    ```
    
    Microbench of the definition-level fill in isolation: 10.3x @ 100%-null,
    8.6x @ 99%, 5.2x @ 90%, 1.9x @ 50%, 0.93x @ 10%, 0.81x @ 0%. Crossover ≈
    12-15% null, clean win above ~25%; the `>= 50% null` guard is
    conservative.
    
    This is the *materialization*-cost half of #9731 (~30% of the 99%-null
    write); the *walk*-cost half, a run-length input to the level encoder so
    the column writer doesn't even iterate all `num_rows` levels, is the
    larger structural change #9653 is heading toward. This PR is
    deliberately small and isolated so it lands independently of and rebases
    cleanly under that work.
    
    ---------
    
    Co-authored-by: Ryan Stewart <[email protected]>
---
 parquet/src/arrow/arrow_writer/levels.rs | 52 +++++++++++++++++++++++++-------
 1 file changed, 41 insertions(+), 11 deletions(-)

diff --git a/parquet/src/arrow/arrow_writer/levels.rs 
b/parquet/src/arrow/arrow_writer/levels.rs
index 73f9221376..10f90f707c 100644
--- a/parquet/src/arrow/arrow_writer/levels.rs
+++ b/parquet/src/arrow/arrow_writer/levels.rs
@@ -153,6 +153,13 @@ enum LevelInfoBuilder {
     Struct(Vec<LevelInfoBuilder>, LevelContext, Option<NullBuffer>),
 }
 
+/// Minimum sub-range length before the bulk-fill fast path in `write_leaf`
+/// becomes profitable for null-heavy leaf columns. Below this, per-call
+/// slice + popcount overhead regresses list/struct paths that call
+/// `write_leaf` many times with tiny ranges. Picked via threshold sweep;
+/// see <https://github.com/apache/arrow-rs/pull/9967> for the rationale.
+const BULK_FILL_MIN_LEN: usize = 64;
+
 impl LevelInfoBuilder {
     /// Create a new [`LevelInfoBuilder`] for the given [`Field`] and parent 
[`LevelContext`]
     fn try_new(field: &Field, parent_ctx: LevelContext, array: &ArrayRef) -> 
Result<Self> {
@@ -676,20 +683,43 @@ impl LevelInfoBuilder {
             match &info.logical_nulls {
                 Some(nulls) => {
                     assert!(range.end <= nulls.len());
-                    let nulls = nulls.inner();
-                    info.def_levels.extend_from_iter(range.clone().map(|i| {
-                        // Safety: range.end was asserted to be in bounds 
earlier
-                        let valid = unsafe { nulls.value_unchecked(i) };
-                        max_def_level - (!valid as i16)
-                    }));
-                    info.non_null_indices.reserve(len);
-                    info.non_null_indices.extend(
-                        BitIndexIterator::new(nulls.inner(), nulls.offset() + 
range.start, len)
-                            .map(|i| i + range.start),
-                    );
+                    // Bulk-fill is profitable only on null-heavy ranges long 
enough to
+                    // amortize the slice/popcount cost; see 
`BULK_FILL_MIN_LEN` and the
+                    // PR description for the threshold sweep. The gate uses 
the cached
+                    // buffer-wide `null_count` (O(1)) to stay cheap on the 
cold path.
+                    if len >= BULK_FILL_MIN_LEN && nulls.null_count() * 2 >= 
nulls.len() {
+                        let range_nulls = nulls.slice(range.start, len);
+                        let valid_in_range = len - range_nulls.null_count();
+                        let null_def_level = max_def_level - 1;
+                        let buf = info
+                            .def_levels
+                            .materialize_mut()
+                            .expect("definition levels present");
+                        let base = buf.len();
+                        buf.resize(base + len, null_def_level);
+                        for i in range_nulls.valid_indices() {
+                            buf[base + i] = max_def_level;
+                        }
+                        info.non_null_indices.reserve(valid_in_range);
+                        info.non_null_indices
+                            .extend(range_nulls.valid_indices().map(|i| i + 
range.start));
+                    } else {
+                        let bits = nulls.inner();
+                        info.def_levels.extend_from_iter(range.clone().map(|i| 
{
+                            // Safety: range.end was asserted to be in bounds 
earlier
+                            let valid = unsafe { bits.value_unchecked(i) };
+                            max_def_level - (!valid as i16)
+                        }));
+                        info.non_null_indices.reserve(len);
+                        info.non_null_indices.extend(
+                            BitIndexIterator::new(bits.inner(), bits.offset() 
+ range.start, len)
+                                .map(|i| i + range.start),
+                        );
+                    }
                 }
                 None => {
                     info.append_def_level_run(max_def_level, len);
+                    info.non_null_indices.reserve(len);
                     info.non_null_indices.extend(range.clone());
                 }
             }

(arrow-rs) branch main updated: Bulk-fill definition levels for majority-null leaf columns (#9967)

Reply via email to