[I] perf: Use bulk-NULL builders in split and substring [datafusion-comet]

via GitHub Thu, 21 May 2026 18:49:18 -0700


mbutrovich opened a new issue, #4397:
URL: https://github.com/apache/datafusion-comet/issues/4397


   ### What is the problem the feature request solves?
   
   Several DataFusion PRs from @neilconway (#21789, #21849, #21863, #21845, 
#21847, #21854, #21877, #21991, #22029, #21366, #21519) replaced per-row null 
buffer maintenance in string functions with bulk null handling. The pattern: 
when an output array's null mask is identical to (or a simple function of) the 
input's, don't `append_null`/`append(true/false)` every iteration. Either clone 
the input `NullBuffer` up front or compute it once with 
`NullBuffer::union[_many]`. Reported wins were 3-15% on microbenchmarks.
   
   Two of Comet's native string functions still follow the old per-row pattern.
   
   **1. `native/spark-expr/src/string_funcs/split.rs`**
   
   Both `split_array` (lines 136-161) and `split_large_string_array` (lines 
173-198) maintain a per-row `BooleanBufferBuilder` even though `split` produces 
NULL output exactly when the input is NULL. The output null buffer is just 
`string_array.nulls().cloned()`.
   
   ```rust
   let mut null_buffer_builder = 
arrow::array::BooleanBufferBuilder::new(string_array.len());
   for i in 0..string_array.len() {
       if string_array.is_null(i) {
           offsets.push(offsets[i]);
           null_buffer_builder.append(false);
       } else {
           ...
           null_buffer_builder.append(true);
       }
   }
   ```
   
   **2. `native/spark-expr/src/string_funcs/substring.rs:138-191`**
   
   `spark_substring_negative_start` uses `GenericStringBuilder` / 
`GenericBinaryBuilder` with per-row `append_null()` across four arms (Utf8, 
LargeUtf8, Binary, LargeBinary). Output nulls equal input nulls. Same 
anti-pattern as DataFusion's `substr` before #21519/#21366.
   
   ### Describe the potential solution
   
   Adopt the bulk-NULL pattern in both files:
   
   1. Clone the input `NullBuffer` once up front.
   2. Iterate values only; for null rows, push an empty placeholder for the 
offset/value buffer and skip null bookkeeping.
   3. Attach the cloned null buffer when constructing the final array.
   
   Two additional cleanups worth bundling into `split.rs` while in the file:
   
   - `split_large_string_array` returns `GenericStringArray::<i32>` despite a 
`LargeUtf8` input, which can overflow offsets for large values. The result type 
should mirror the input (`i64` offsets when input is LargeUtf8).
   - The hot loop materializes a `Vec<String>` from `regex::Regex::split` 
(which already yields `&str`) and then copies into the array. Appending `&str` 
directly via the builder skips the intermediate `String` allocations. Mirrors 
the `append_with` improvement from DataFusion #22029.
   
   ### Additional context
   
   Reference PRs in DataFusion that established the pattern:
   
   - apache/datafusion#21789 — adds bulk NULL-aware string builders, uses them 
in `lower` / `upper`
   - apache/datafusion#21519, #21366 — bulk-NULL applied to `substr`
   - apache/datafusion#21849, #21854, #21863, #21845, #21847, #21877, #21991, 
#22029 — same pattern across `replace`, `repeat`, `initcap`, `uuid`, `chr`, 
`substr_index`, `reverse`
   
   Other Comet sites considered but rejected:
   
   - `array_funcs/arrays_overlap.rs` — row-by-row builder, but the inner loop 
has early-exit and tri-valued null logic, so bulk-NULL doesn't apply directly.
   - `array_funcs/array_position.rs` — already uses a `combined_nulls` helper 
plus flat-buffer scan; this is the post-optimization shape.
   - `agg_funcs/{covariance,correlation,stddev,variance,avg,sum_*}.rs` — 
already have `GroupsAccumulator` impls covering the territory of DataFusion 
#20504 / #21154.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] perf: Use bulk-NULL builders in split and substring [datafusion-comet]

Reply via email to