mbutrovich opened a new issue, #22454:
URL: https://github.com/apache/datafusion/issues/22454

   ### Is your feature request related to a problem or challenge?
   
   Two patterns from @neilconway's recent perf PRs are not yet applied 
throughout the `datafusion-spark` crate:
   
   1. **Bulk-NULL builders** 
([#21789](https://github.com/apache/datafusion/pull/21789), 
[#21849](https://github.com/apache/datafusion/pull/21849), 
[#21863](https://github.com/apache/datafusion/pull/21863), 
[#21847](https://github.com/apache/datafusion/pull/21847), 
[#21854](https://github.com/apache/datafusion/pull/21854), 
[#21877](https://github.com/apache/datafusion/pull/21877), 
[#21991](https://github.com/apache/datafusion/pull/21991), 
[#22029](https://github.com/apache/datafusion/pull/22029), 
[#21519](https://github.com/apache/datafusion/pull/21519), 
[#21366](https://github.com/apache/datafusion/pull/21366)) — when an output 
null mask equals (or is a `NullBuffer::union[_many]` of) the input null masks, 
precompute it once instead of maintaining it row-by-row with `append_null` / 
`BooleanBufferBuilder.append`.
   2. **`NullBuffer::union_many`** 
([#22070](https://github.com/apache/datafusion/pull/22070)) — replaces chains 
of 3+ binary `NullBuffer::union` calls with one clone + in-place `&=`.
   
   An audit of the Spark crate turned up six sites where these patterns apply.
   
   ### Describe the solution you'd like
   
   Apply the patterns at the following call sites. Each checkbox is one PR.
   
   - [ ] **`datafusion/spark/src/function/math/width_bucket.rs:206-276, 
313-432`** — `width_bucket` kernel has four nullable inputs (value, min, max, 
n_buckets). The macro generates the same code for Float64, Duration, 
IntervalYearMonth, IntervalMonthDayNano. Precompute 
`NullBuffer::union_many([v.nulls(), min.nulls(), max.nulls(), n.nulls()])` 
outside the loop, append placeholders for masked rows, reattach at the end.
   - [ ] **`datafusion/spark/src/function/array/slice.rs:152-186`** — 
`calculate_start_end` runs two `Int64Builder`s in lockstep, both `append_null` 
when any of (values, start, length) is null. Precompute `union_many` of three 
input nulls, fill placeholders, reattach to both builders.
   - [ ] **`datafusion/spark/src/function/string/substring.rs:371-401`** — 
`SparkSubstring`'s main loop. Three-input null union (array + start_array + 
optional length_array). Negative-length branch still needs row-level 
`append_empty`, but the null mask itself can be precomputed.
   - [ ] 
**`datafusion/spark/src/function/datetime/make_dt_interval.rs:170-197`** — 1-4 
nullable inputs wrapped in `Option<&ArrayRef>`. `union_many` over the 
`Some(...)` arrays' nulls.
   - [ ] **`datafusion/spark/src/function/datetime/make_interval.rs:190-`** — 
same structure as `make_dt_interval`.
   - [ ] **`datafusion/spark/src/function/string/format_string.rs:126-157`** — 
accumulates `Vec<String>` then `StringArray::from(vec)`. Switch to 
`GenericStringBuilder` + `append_value(&str)`. Matches the `append_with` 
pattern from #22029.
   - [ ] **`datafusion/spark/src/function/conversion/cast.rs:199-209, 
226-236`** — `cast_int_to_timestamp` / `cast_float_to_timestamp` do per-row 
`append_null` / `append_value` where output nulls equal input nulls. Clone 
input nulls.
   
   ### Describe alternatives you've considered
   
   Bundling all sites into a single PR. Splitting per file keeps each PR 
reviewable, especially the `width_bucket` macro change which touches four type 
variants.
   
   ### Additional context
   
   Already use these patterns, skip:
   
   - `datafusion/spark/src/function/spark/null_utils.rs:63` — uses `union_many`.
   - `datafusion/spark/src/function/map/str_to_map.rs:212` — uses `union_many`.
   - `datafusion/spark/src/function/hash/utils.rs` — fast-path null-count 
checks.
   
   Considered and rejected:
   
   - `datafusion/spark/src/function/string/elt.rs:115-139` — output null mask 
is data-dependent (which column's nulls to include depends on the runtime `idx` 
value).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to