Dandandan opened a new pull request, #22827: URL: https://github.com/apache/datafusion/pull/22827
## Which issue does this PR close? N/A. ## Rationale for this change Right semi hash joins backed by ArrayMap only need to know whether each probe-side row has at least one build-side match. The previous path materialized every duplicate build-side match and then deduplicated probe indices, which is expensive for fanout-heavy keys. On this machine, `right_semi_fanout100_h1` went from about `4.20 ms` to about `920 us`, a roughly 78% reduction. ## What changes are included in this PR? - Add an ArrayMap lookup that emits one matching probe index per probe row. - Use that lookup for unfiltered HashJoinExec RightSemi joins when the build map is ArrayMap. - Add coverage for duplicate build keys and limited-offset continuation. ## Are these changes tested? - `cargo fmt --all` - `cargo clippy --all-targets --all-features -- -D warnings` - `cargo test -p datafusion-physical-plan --lib test_array_map_matching_probe_indices_omits_build_duplicates` - `cargo test -p datafusion-physical-plan --lib join_right_semi` - `cargo bench -p datafusion-physical-plan --features test_utils --bench hash_join_semi_anti -- right_semi_fanout100_h1 --sample-size 10 --warm-up-time 1 --measurement-time 3` ## Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
