Dandandan commented on code in PR #6679:
URL: https://github.com/apache/arrow-datafusion/pull/6679#discussion_r1233948258
##########
datafusion/core/src/physical_plan/joins/hash_join.rs:
##########
@@ -563,16 +563,24 @@ pub fn update_hash(
// insert hashes to key of the hashmap
for (row, hash_value) in hash_values.iter().enumerate() {
let item = hash_map
- .0
+ .map
.get_mut(*hash_value, |(hash, _)| *hash_value == *hash);
- if let Some((_, indices)) = item {
- indices.push((row + offset) as u64);
+ if let Some((_, index)) = item {
+ // Already exists: add index to next array
+ let prev_index = *index;
+ // Store new value inside hashmap
+ *index = (row + offset + 1) as u64;
Review Comment:
I think the reason is that while iterating over the hashes/indices we get
the latest index (which contains both the value **and** points to the previous
index each time) as a constant time operation. Not sure how it would work when
holding the chain start in the map as we have to iterate the map first to get
to the last?
It would be possible (though seems not beneficial for the normal hash join)
to also keep the start of the chain in the hashmap.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]