Dandandan opened a new pull request, #22827:
URL: https://github.com/apache/datafusion/pull/22827

   ## Which issue does this PR close?
   
   N/A.
   
   ## Rationale for this change
   
   Right semi hash joins backed by ArrayMap only need to know whether each 
probe-side row has at least one build-side match. The previous path 
materialized every duplicate build-side match and then deduplicated probe 
indices, which is expensive for fanout-heavy keys.
   
   On this machine, `right_semi_fanout100_h1` went from about `4.20 ms` to 
about `920 us`, a roughly 78% reduction.
   
   ## What changes are included in this PR?
   
   - Add an ArrayMap lookup that emits one matching probe index per probe row.
   - Use that lookup for unfiltered HashJoinExec RightSemi joins when the build 
map is ArrayMap.
   - Add coverage for duplicate build keys and limited-offset continuation.
   
   ## Are these changes tested?
   
   - `cargo fmt --all`
   - `cargo clippy --all-targets --all-features -- -D warnings`
   - `cargo test -p datafusion-physical-plan --lib 
test_array_map_matching_probe_indices_omits_build_duplicates`
   - `cargo test -p datafusion-physical-plan --lib join_right_semi`
   - `cargo bench -p datafusion-physical-plan --features test_utils --bench 
hash_join_semi_anti -- right_semi_fanout100_h1 --sample-size 10 --warm-up-time 
1 --measurement-time 3`
   
   ## Are there any user-facing changes?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to