Dandandan opened a new pull request, #22654: URL: https://github.com/apache/datafusion/pull/22654
## Which issue does this PR close? - Addresses #12131 (TODO in `equal_rows_arr`: "optimize equal_rows_arr to avoid allocation of intermediate arrays"). ## Rationale for this change In the hash join probe path, `equal_rows_arr` verifies candidate `(build, probe)` row pairs produced by hash-map chain traversal. The previous implementation called `arrow::take` on **every key column for both sides** of each probe chunk, materializing temporary arrays that were immediately discarded after an `eq` + `and` + `FilterBuilder` pass — `2 * n_columns` array allocations plus per-column boolean arrays, on every probe batch regardless of join selectivity. This is one of the hottest per-batch allocation sites in the hash join. ## What changes are included in this PR? Add a row-wise fast path to `equal_rows_arr`: - Build **one equality closure per key column** (downcast once per batch pair). - Compare candidate rows **in place by index** in a single pass, fusing the random-access load with the comparison, and push surviving indices straight into the output buffers. - No intermediate `take` materialization, no per-column boolean arrays, no separate filter pass. Key columns whose types are not handled by the fast path (nested types, dictionaries, etc.) fall back to the original array-based implementation (`equal_rows_arr_take`), so behavior is unchanged for those. **Semantics are preserved exactly** to match the `eq` / `not_distinct` kernels: - non-null vs non-null: native `==` — notably float `NaN != NaN`, matching the `eq` kernel (and unlike `make_comparator`'s total ordering); - one side null: never equal; - both null: equal only under `NullEquality::NullEqualsNull`. ## Are these changes tested? Covered by existing tests: - `cargo test -p datafusion-physical-plan --lib joins::` — 967 join unit tests pass (inner/left/right/full/semi/anti, multi-column keys, null-equality variants). - `cargo test -p datafusion-sqllogictest --test sqllogictests -- joins` — passes. The fast path is exercised by the common primitive/string/binary/temporal key types; the fallback by nested/dictionary keys. No new behavior is introduced, so existing coverage applies. ## Are there any user-facing changes? No. This is an internal performance optimization with identical results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
