Dandandan opened a new pull request, #22653:
URL: https://github.com/apache/datafusion/pull/22653

   ## Which issue does this PR close?
   
   - Not applicable.
   
   ## Rationale for this change
   
   Right semi and right anti joins only need to know whether each probe-side 
row has at least one matching build-side row when there is no residual join 
filter. The generic hash join path currently materializes every duplicate 
build-side match and later deduplicates or inverts probe indices, doing 
unnecessary work for duplicated existence-side keys.
   
   In a focused local throwaway benchmark with 10,000 duplicate build rows and 
10,000 probe rows, the old lookup path enumerated 100,000,000 candidate pairs 
in 183.872 ms, while the new existence lookup returned 10,000 probe matches in 
45.791 us.
   
   ## What changes are included in this PR?
   
   - Add a hash-map existence probe that stops walking a duplicate chain after 
the first equality-confirmed match.
   - Add an ArrayMap membership probe for the same right semi/anti use case.
   - Route `RightSemi` and `RightAnti` hash joins without residual filters 
through the existence path.
   - Keep the generic path for joins with residual filters, where duplicate 
build rows may affect filter results.
   - Add a unit test covering early stop behavior for duplicate build-side 
matches.
   
   ## Are these changes tested?
   
   - `cargo fmt --all`
   - `cargo clippy --all-targets --all-features -- -D warnings`
   - `cargo test -p datafusion-physical-plan hash_join --lib`
   - `cargo test -p datafusion-physical-plan 
joins::join_hash_map::tests::test_probe_indices_with_any_match_stops_after_first_match
 --lib`
   
   ## Are there any user-facing changes?
   
   No API or behavior changes expected. This is a physical execution 
optimization for right semi/anti hash joins without residual filters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to