[PR] perf: optimize arrays_zip perfect list zips [datafusion]

via GitHub Sat, 16 May 2026 16:03:27 -0700


puneetdixit200 opened a new pull request, #22285:
URL: https://github.com/apache/datafusion/pull/22285


   ## Which issue does this PR close?
   
   - Closes #22225.
   
   ## Rationale for this change
   
   `arrays_zip` currently uses the general `MutableArrayData` path even when 
all regular `ListArray` inputs are already perfectly aligned. In that case, the 
output list offsets match the inputs and each struct child column can reuse the 
corresponding input values array instead of copying one row at a time.
   
   ## What changes are included in this PR?
   
   - Add a fast path for perfect regular `ListArray` zips that reuses the first 
input's offsets and clones the input values arrays into the output struct 
children.
   - Keep the existing general path for ragged inputs, `LargeList`, 
`FixedSizeList`, `Null` inputs, and null rows that would require padding.
   - Add unit coverage for offset/value reuse, zero-length null rows, and null 
rows with hidden values falling back to the general path.
   - Rename the no-null benchmark case to `arrays_zip_perfect_zip_8192`.
   
   ## Are these changes tested?
   
   Yes. Local checks run:
   
   - `cargo fmt --all`
   - `cargo test -p datafusion-functions-nested`
   - `cargo clippy -p datafusion-functions-nested --all-targets --all-features 
-- -D warnings`
   - `CARGO_TARGET_DIR=C:\df-target cargo clippy --all-targets --all-features 
-- -D warnings`
   - `CARGO_TARGET_DIR=C:\df-target cargo bench -p datafusion-functions-nested 
--bench arrays_zip -- --warm-up-time 1 --measurement-time 2 --sample-size 10`
   
   Latest local benchmark sample:
   
   - `arrays_zip_perfect_zip_8192`: `11.234 µs 11.600 µs 12.045 µs`
   - `arrays_zip_10pct_nulls_8192`: `4.3463 ms 4.5531 ms 4.7898 ms`
   
   ## Are there any user-facing changes?
   
   No. This is an internal performance optimization with the same `arrays_zip` 
output semantics.## Which issue does this PR close?
   
   - Closes #22225.
   
   ## Rationale for this change
   
   `arrays_zip` currently uses the general `MutableArrayData` path even when 
all regular `ListArray` inputs are already perfectly aligned. In that case, the 
output list offsets match the inputs and each struct child column can reuse the 
corresponding input values array instead of copying one row at a time.
   
   ## What changes are included in this PR?
   
   - Add a fast path for perfect regular `ListArray` zips that reuses the first 
input's offsets and clones the input values arrays into the output struct 
children.
   - Keep the existing general path for ragged inputs, `LargeList`, 
`FixedSizeList`, `Null` inputs, and null rows that would require padding.
   - Add unit coverage for offset/value reuse, zero-length null rows, and null 
rows with hidden values falling back to the general path.
   - Rename the no-null benchmark case to `arrays_zip_perfect_zip_8192`.
   
   ## Are these changes tested?
   
   Yes. Local checks run:
   
   - `cargo fmt --all`
   - `cargo test -p datafusion-functions-nested`
   - `cargo clippy -p datafusion-functions-nested --all-targets --all-features 
-- -D warnings`
   - `CARGO_TARGET_DIR=C:\df-target cargo clippy --all-targets --all-features 
-- -D warnings`
   - `CARGO_TARGET_DIR=C:\df-target cargo bench -p datafusion-functions-nested 
--bench arrays_zip -- --warm-up-time 1 --measurement-time 2 --sample-size 10`
   
   Latest local benchmark sample:
   
   - `arrays_zip_perfect_zip_8192`: `11.234 µs 11.600 µs 12.045 µs`
   - `arrays_zip_10pct_nulls_8192`: `4.3463 ms 4.5531 ms 4.7898 ms`
   
   ## Are there any user-facing changes?
   
   No. This is an internal performance optimization with the same `arrays_zip` 
output semantics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] perf: optimize arrays_zip perfect list zips [datafusion]

Reply via email to