mbutrovich opened a new issue, #4397:
URL: https://github.com/apache/datafusion-comet/issues/4397
### What is the problem the feature request solves?
Several DataFusion PRs from @neilconway (#21789, #21849, #21863, #21845,
#21847, #21854, #21877, #21991, #22029, #21366, #21519) replaced per-row null
buffer maintenance in string functions with bulk null handling. The pattern:
when an output array's null mask is identical to (or a simple function of) the
input's, don't `append_null`/`append(true/false)` every iteration. Either clone
the input `NullBuffer` up front or compute it once with
`NullBuffer::union[_many]`. Reported wins were 3-15% on microbenchmarks.
Two of Comet's native string functions still follow the old per-row pattern.
**1. `native/spark-expr/src/string_funcs/split.rs`**
Both `split_array` (lines 136-161) and `split_large_string_array` (lines
173-198) maintain a per-row `BooleanBufferBuilder` even though `split` produces
NULL output exactly when the input is NULL. The output null buffer is just
`string_array.nulls().cloned()`.
```rust
let mut null_buffer_builder =
arrow::array::BooleanBufferBuilder::new(string_array.len());
for i in 0..string_array.len() {
if string_array.is_null(i) {
offsets.push(offsets[i]);
null_buffer_builder.append(false);
} else {
...
null_buffer_builder.append(true);
}
}
```
**2. `native/spark-expr/src/string_funcs/substring.rs:138-191`**
`spark_substring_negative_start` uses `GenericStringBuilder` /
`GenericBinaryBuilder` with per-row `append_null()` across four arms (Utf8,
LargeUtf8, Binary, LargeBinary). Output nulls equal input nulls. Same
anti-pattern as DataFusion's `substr` before #21519/#21366.
### Describe the potential solution
Adopt the bulk-NULL pattern in both files:
1. Clone the input `NullBuffer` once up front.
2. Iterate values only; for null rows, push an empty placeholder for the
offset/value buffer and skip null bookkeeping.
3. Attach the cloned null buffer when constructing the final array.
Two additional cleanups worth bundling into `split.rs` while in the file:
- `split_large_string_array` returns `GenericStringArray::<i32>` despite a
`LargeUtf8` input, which can overflow offsets for large values. The result type
should mirror the input (`i64` offsets when input is LargeUtf8).
- The hot loop materializes a `Vec<String>` from `regex::Regex::split`
(which already yields `&str`) and then copies into the array. Appending `&str`
directly via the builder skips the intermediate `String` allocations. Mirrors
the `append_with` improvement from DataFusion #22029.
### Additional context
Reference PRs in DataFusion that established the pattern:
- apache/datafusion#21789 — adds bulk NULL-aware string builders, uses them
in `lower` / `upper`
- apache/datafusion#21519, #21366 — bulk-NULL applied to `substr`
- apache/datafusion#21849, #21854, #21863, #21845, #21847, #21877, #21991,
#22029 — same pattern across `replace`, `repeat`, `initcap`, `uuid`, `chr`,
`substr_index`, `reverse`
Other Comet sites considered but rejected:
- `array_funcs/arrays_overlap.rs` — row-by-row builder, but the inner loop
has early-exit and tri-valued null logic, so bulk-NULL doesn't apply directly.
- `array_funcs/array_position.rs` — already uses a `combined_nulls` helper
plus flat-buffer scan; this is the post-optimization shape.
- `agg_funcs/{covariance,correlation,stddev,variance,avg,sum_*}.rs` —
already have `GroupsAccumulator` impls covering the territory of DataFusion
#20504 / #21154.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]