Re: [PR] perf(hash-join): eliminate intermediate array allocations in probe-side collision filter [datafusion]

via GitHub Tue, 30 Jun 2026 12:07:11 -0700


Dandandan commented on code in PR #23209:
URL: https://github.com/apache/datafusion/pull/23209#discussion_r3501209906



##########
datafusion/physical-plan/src/joins/join_hash_map.rs:
##########
@@ -491,6 +521,40 @@ pub fn contain_hashes<T>(map: &HashTable<(u64, T)>, 
hash_values: &[u64]) -> Bool
     BooleanArray::new(buffer, None)
 }
 
+/// Scans the `next` chain to detect real hash collisions (two build rows in
+/// the same bucket with different keys). `next[i]` stores `prev_row + 1`
+/// (`0` = end of chain). Checking every adjacent linked pair is sufficient:
+/// any two distinct keys in the same bucket must appear as neighbors 
somewhere.
+///
+/// Example — keys `["cat", "cat", "dog"]`, next `[0, 1, 2]`:
+///   row 1 → prev 0: "cat"=="cat" ✓
+///   row 2 → prev 1: "dog"!="cat" → return true (collision found)
+fn detect_key_collisions<T>(

Review Comment:
   Yes I think this could be faster at least for single column keys / with type 
/ join type specialization. 
   
   The reason I think we / I pursued this design is that `take` is (much) 
faster than individual dispatch (and most recent literature also seems to 
implement it in a separate pass) and probably easier to implement all the join 
types.
   
   Also combining it with filter is a bit easier with the indices -> filter 
indices -> collision check, etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf(hash-join): eliminate intermediate array allocations in probe-side collision filter [datafusion]

Reply via email to