LiaCastaneda commented on code in PR #23209:
URL: https://github.com/apache/datafusion/pull/23209#discussion_r3492796547


##########
datafusion/physical-plan/src/joins/join_hash_map.rs:
##########
@@ -491,6 +521,40 @@ pub fn contain_hashes<T>(map: &HashTable<(u64, T)>, 
hash_values: &[u64]) -> Bool
     BooleanArray::new(buffer, None)
 }
 
+/// Scans the `next` chain to detect real hash collisions (two build rows in
+/// the same bucket with different keys). `next[i]` stores `prev_row + 1`
+/// (`0` = end of chain). Checking every adjacent linked pair is sufficient:
+/// any two distinct keys in the same bucket must appear as neighbors 
somewhere.
+///
+/// Example — keys `["cat", "cat", "dog"]`, next `[0, 1, 2]`:
+///   row 1 → prev 0: "cat"=="cat" ✓
+///   row 2 → prev 1: "dog"!="cat" → return true (collision found)
+fn detect_key_collisions<T>(

Review Comment:
   I checked in the other engine (Trino) and there when a build row is 
inserted, Trino only inserts rows that share the same key, so by construction, 
every chain is already "pure" and hash collisions are not possible.
   This happens because Trino uses a hash table that resolves key equality at 
insert time. DataFusion's `update_from_iter` only receives hashes and row 
indices  (it has no access to key values) so it chains all rows in the same 
hash bucket together regardless of whether they share a key or not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to