LiaCastaneda commented on code in PR #23209:
URL: https://github.com/apache/datafusion/pull/23209#discussion_r3492811285


##########
datafusion/physical-plan/src/joins/join_hash_map.rs:
##########
@@ -491,6 +521,40 @@ pub fn contain_hashes<T>(map: &HashTable<(u64, T)>, 
hash_values: &[u64]) -> Bool
     BooleanArray::new(buffer, None)
 }
 
+/// Scans the `next` chain to detect real hash collisions (two build rows in
+/// the same bucket with different keys). `next[i]` stores `prev_row + 1`
+/// (`0` = end of chain). Checking every adjacent linked pair is sufficient:
+/// any two distinct keys in the same bucket must appear as neighbors 
somewhere.
+///
+/// Example — keys `["cat", "cat", "dog"]`, next `[0, 1, 2]`:
+///   row 1 → prev 0: "cat"=="cat" ✓
+///   row 2 → prev 1: "dog"!="cat" → return true (collision found)
+fn detect_key_collisions<T>(

Review Comment:
   Note doing the same in DF would be non trivial because it would require 
modifying `update_from_iter` trait signature to accept key columns alongside 
the hashes, and updating all call sites accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to