Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r621193159
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -399,15 +395,23 @@ fn update_hash(
.map(|name|
Ok(col(name).evaluate(batch)?.into_array(batch.num_rows())))
.collect::<Result<Vec<_>>>()?;
- // update the hash map
+ // calculate the hash values
let hash_values = create_hashes(&keys_values, &random_state,
hashes_buffer)?;
// insert hashes to key of the hashmap
for (row, hash_value) in hash_values.iter().enumerate() {
- hash.raw_entry_mut()
- .from_key_hashed_nocheck(*hash_value, hash_value)
- .and_modify(|_, v| v.push((row + offset) as u64))
- .or_insert_with(|| (*hash_value, smallvec![(row + offset) as
u64]));
+ match hash.raw_entry_mut().from_hash(*hash_value, |_| true) {
+ hashbrown::hash_map::RawEntryMut::Occupied(mut entry) => {
+ entry.get_mut().push((row + offset) as u64);
+ }
+ hashbrown::hash_map::RawEntryMut::Vacant(entry) => {
+ entry.insert_hashed_nocheck(
Review comment:
The overall algorithm does do the checking for collisions (values with
the same hash) in the later stage. There is a unit checking this case too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]