SubhamSinghal opened a new pull request, #21775: URL: https://github.com/apache/datafusion/pull/21775
## Which issue does this PR close?
- Closes #.
## Rationale for this change
For RightSemi/RightAnti hash joins, the probe phase only needs to know
whether **at least one** matching build row exists per probe row. However, the
current code:
1. Traverses the **entire** collision chain, emitting every `(probe,
build)` pair
2. Runs `equal_rows_arr` on **all** pairs (`take()` + `eq_dyn_null()`,
allocating intermediate arrays)
3. Deduplicates via `get_semi_indices` to keep just the first match per
probe row
For skewed data with N duplicate build rows per key, this produces N pairs
per probe row only to keep 1 — O(N) wasted work in both chain traversal and
equality verification.
## What changes are included in this PR?
**New `get_first_match` method on `JoinHashMapType` trait**
(`join_hash_map.rs`):
- Walks the collision chain for a hash value, stops at the first entry
where a caller-provided predicate returns true
- Returns `Option<u64>` — the matching build row index, or `None`
- Shared implementation in `get_first_match_impl`, reused by
`JoinHashMapU32`, `JoinHashMapU64`, and `PruningJoinHashMap`
**New `process_probe_batch_right_semi_anti` function** (`stream.rs`):
- For each probe row, calls `get_first_match` with an equality predicate
built from `JoinKeyComparator::is_equal` (zero-allocation per-row comparison)
- Collects matched probe indices (RightSemi) or unmatched probe indices
(RightAnti)
- Builds output via existing `build_batch_from_indices`
## Are these changes tested?
Covered by existing tests:
- 791 hash join unit tests (all pass)
- 14 SLT join test files (all pass)
- TPC-H SF1 benchmark: no regressions, Q4/Q21 (semi/anti patterns) show
minor improvements
No new tests added — this is a pure optimization that produces identical
results. The existing test suite comprehensively covers RightSemi/RightAnti
join.
## Are there any user-facing changes?
No.
## TPC-H Benchmark result:
| Query | main (ms) | new (ms) | Change |
|-------|-----------|----------|--------|
| Q1 | 2.96 | 2.89 | ~same |
| Q2 | 3.12 | 2.80 | ~same |
| Q3 | 1.70 | 1.77 | ~same |
| Q4 | 1.23 | 1.15 | -7% |
| Q5 | 2.11 | 2.04 | ~same |
| Q6 | 0.69 | 0.75 | ~same |
| Q7 | 2.58 | 2.51 | ~same |
| Q8 | 2.94 | 2.91 | ~same |
| Q9 | 2.19 | 2.03 | ~same |
| Q10 | 2.20 | 2.15 | ~same |
| Q11 | 1.95 | 1.91 | ~same |
| Q12 | 1.38 | 1.36 | ~same |
| Q13 | 1.23 | 1.10 | -11% |
| Q14 | 1.18 | 1.12 | ~same |
| Q15 | 2.28 | 2.22 | ~same |
| Q16 | 1.76 | 1.70 | ~same |
| Q17 | 1.37 | 1.26 | ~same |
| Q18 | 1.61 | 1.45 | -10% |
| Q19 | 2.28 | 2.05 | -10% |
| Q20 | 2.01 | 1.81 | -10% |
| Q21 | 2.72 | 2.23 | **-18%** |
| Q22 | 1.89 | 1.75 | ~same |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
