Dandandan opened a new pull request, #22654:
URL: https://github.com/apache/datafusion/pull/22654

   ## Which issue does this PR close?
   
   - Addresses #12131 (TODO in `equal_rows_arr`: "optimize equal_rows_arr to 
avoid allocation of intermediate arrays").
   
   ## Rationale for this change
   
   In the hash join probe path, `equal_rows_arr` verifies candidate `(build, 
probe)` row pairs produced by hash-map chain traversal. The previous 
implementation called `arrow::take` on **every key column for both sides** of 
each probe chunk, materializing temporary arrays that were immediately 
discarded after an `eq` + `and` + `FilterBuilder` pass — `2 * n_columns` array 
allocations plus per-column boolean arrays, on every probe batch regardless of 
join selectivity.
   
   This is one of the hottest per-batch allocation sites in the hash join.
   
   ## What changes are included in this PR?
   
   Add a row-wise fast path to `equal_rows_arr`:
   
   - Build **one equality closure per key column** (downcast once per batch 
pair).
   - Compare candidate rows **in place by index** in a single pass, fusing the 
random-access load with the comparison, and push surviving indices straight 
into the output buffers.
   - No intermediate `take` materialization, no per-column boolean arrays, no 
separate filter pass.
   
   Key columns whose types are not handled by the fast path (nested types, 
dictionaries, etc.) fall back to the original array-based implementation 
(`equal_rows_arr_take`), so behavior is unchanged for those.
   
   **Semantics are preserved exactly** to match the `eq` / `not_distinct` 
kernels:
   - non-null vs non-null: native `==` — notably float `NaN != NaN`, matching 
the `eq` kernel (and unlike `make_comparator`'s total ordering);
   - one side null: never equal;
   - both null: equal only under `NullEquality::NullEqualsNull`.
   
   ## Are these changes tested?
   
   Covered by existing tests:
   - `cargo test -p datafusion-physical-plan --lib joins::` — 967 join unit 
tests pass (inner/left/right/full/semi/anti, multi-column keys, null-equality 
variants).
   - `cargo test -p datafusion-sqllogictest --test sqllogictests -- joins` — 
passes.
   
   The fast path is exercised by the common primitive/string/binary/temporal 
key types; the fallback by nested/dictionary keys. No new behavior is 
introduced, so existing coverage applies.
   
   ## Are there any user-facing changes?
   
   No. This is an internal performance optimization with identical results.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to