mbutrovich opened a new issue, #22454: URL: https://github.com/apache/datafusion/issues/22454
### Is your feature request related to a problem or challenge? Two patterns from @neilconway's recent perf PRs are not yet applied throughout the `datafusion-spark` crate: 1. **Bulk-NULL builders** ([#21789](https://github.com/apache/datafusion/pull/21789), [#21849](https://github.com/apache/datafusion/pull/21849), [#21863](https://github.com/apache/datafusion/pull/21863), [#21847](https://github.com/apache/datafusion/pull/21847), [#21854](https://github.com/apache/datafusion/pull/21854), [#21877](https://github.com/apache/datafusion/pull/21877), [#21991](https://github.com/apache/datafusion/pull/21991), [#22029](https://github.com/apache/datafusion/pull/22029), [#21519](https://github.com/apache/datafusion/pull/21519), [#21366](https://github.com/apache/datafusion/pull/21366)) — when an output null mask equals (or is a `NullBuffer::union[_many]` of) the input null masks, precompute it once instead of maintaining it row-by-row with `append_null` / `BooleanBufferBuilder.append`. 2. **`NullBuffer::union_many`** ([#22070](https://github.com/apache/datafusion/pull/22070)) — replaces chains of 3+ binary `NullBuffer::union` calls with one clone + in-place `&=`. An audit of the Spark crate turned up six sites where these patterns apply. ### Describe the solution you'd like Apply the patterns at the following call sites. Each checkbox is one PR. - [ ] **`datafusion/spark/src/function/math/width_bucket.rs:206-276, 313-432`** — `width_bucket` kernel has four nullable inputs (value, min, max, n_buckets). The macro generates the same code for Float64, Duration, IntervalYearMonth, IntervalMonthDayNano. Precompute `NullBuffer::union_many([v.nulls(), min.nulls(), max.nulls(), n.nulls()])` outside the loop, append placeholders for masked rows, reattach at the end. - [ ] **`datafusion/spark/src/function/array/slice.rs:152-186`** — `calculate_start_end` runs two `Int64Builder`s in lockstep, both `append_null` when any of (values, start, length) is null. Precompute `union_many` of three input nulls, fill placeholders, reattach to both builders. - [ ] **`datafusion/spark/src/function/string/substring.rs:371-401`** — `SparkSubstring`'s main loop. Three-input null union (array + start_array + optional length_array). Negative-length branch still needs row-level `append_empty`, but the null mask itself can be precomputed. - [ ] **`datafusion/spark/src/function/datetime/make_dt_interval.rs:170-197`** — 1-4 nullable inputs wrapped in `Option<&ArrayRef>`. `union_many` over the `Some(...)` arrays' nulls. - [ ] **`datafusion/spark/src/function/datetime/make_interval.rs:190-`** — same structure as `make_dt_interval`. - [ ] **`datafusion/spark/src/function/string/format_string.rs:126-157`** — accumulates `Vec<String>` then `StringArray::from(vec)`. Switch to `GenericStringBuilder` + `append_value(&str)`. Matches the `append_with` pattern from #22029. - [ ] **`datafusion/spark/src/function/conversion/cast.rs:199-209, 226-236`** — `cast_int_to_timestamp` / `cast_float_to_timestamp` do per-row `append_null` / `append_value` where output nulls equal input nulls. Clone input nulls. ### Describe alternatives you've considered Bundling all sites into a single PR. Splitting per file keeps each PR reviewable, especially the `width_bucket` macro change which touches four type variants. ### Additional context Already use these patterns, skip: - `datafusion/spark/src/function/spark/null_utils.rs:63` — uses `union_many`. - `datafusion/spark/src/function/map/str_to_map.rs:212` — uses `union_many`. - `datafusion/spark/src/function/hash/utils.rs` — fast-path null-count checks. Considered and rejected: - `datafusion/spark/src/function/string/elt.rs:115-139` — output null mask is data-dependent (which column's nulls to include depends on the runtime `idx` value). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
