Dandandan commented on code in PR #23209:
URL: https://github.com/apache/datafusion/pull/23209#discussion_r3501209906
##########
datafusion/physical-plan/src/joins/join_hash_map.rs:
##########
@@ -491,6 +521,40 @@ pub fn contain_hashes<T>(map: &HashTable<(u64, T)>,
hash_values: &[u64]) -> Bool
BooleanArray::new(buffer, None)
}
+/// Scans the `next` chain to detect real hash collisions (two build rows in
+/// the same bucket with different keys). `next[i]` stores `prev_row + 1`
+/// (`0` = end of chain). Checking every adjacent linked pair is sufficient:
+/// any two distinct keys in the same bucket must appear as neighbors
somewhere.
+///
+/// Example — keys `["cat", "cat", "dog"]`, next `[0, 1, 2]`:
+/// row 1 → prev 0: "cat"=="cat" ✓
+/// row 2 → prev 1: "dog"!="cat" → return true (collision found)
+fn detect_key_collisions<T>(
Review Comment:
Yes I think this could be faster at least for single column keys / with type
/ join type specialization.
The reason I think we / I pursued this design is that `take` is (much)
faster than individual dispatch (and most recent literature also seems to
implement it in a separate pass) and probably easier to implement all the join
types.
Also combining it with filter is a bit easier with the indices -> filter
indices -> collision check, etc.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]