[GitHub] [arrow-datafusion] metesynnada commented on a diff in pull request #6679: Improve performance/memory usage of HashJoin datastructure (5-15% improvement on selected TPC-H queries)

via GitHub Mon, 19 Jun 2023 04:57:56 -0700


metesynnada commented on code in PR #6679:
URL: https://github.com/apache/arrow-datafusion/pull/6679#discussion_r1233951246



##########
datafusion/core/src/physical_plan/joins/hash_join.rs:
##########
@@ -727,12 +735,12 @@ pub fn build_equal_condition_join_indices(
         // For every item on the build and probe we check if it matches
         // This possibly contains rows with hash collisions,
         // So we have to check here whether rows are equal or not
-        if let Some((_, indices)) = build_hashmap
-            .0
+        if let Some((_, index)) = build_hashmap
+            .map
             .get(*hash_value, |(hash, _)| *hash_value == *hash)
         {
-            for &i in indices {
-                // Check hash collisions
+            let mut i = *index - 1;
+            loop {
                 let offset_build_index = i as usize - offset_value;

Review Comment:
   Offset is not necessary for usual hash join, you can remove it safely.



##########
datafusion/core/src/physical_plan/joins/hash_join.rs:
##########
@@ -563,16 +563,24 @@ pub fn update_hash(
     // insert hashes to key of the hashmap
     for (row, hash_value) in hash_values.iter().enumerate() {
         let item = hash_map
-            .0
+            .map
             .get_mut(*hash_value, |(hash, _)| *hash_value == *hash);
-        if let Some((_, indices)) = item {
-            indices.push((row + offset) as u64);
+        if let Some((_, index)) = item {
+            // Already exists: add index to next array
+            let prev_index = *index;
+            // Store new value inside hashmap
+            *index = (row + offset + 1) as u64;

Review Comment:
   Yeah, there is no gain for the usual hash join, but pruning becomes much 
more expensive if I do not have the beginning. I think I will not push for it, 
for now, let s have separate hashmap paradigms. 



##########
datafusion/core/src/physical_plan/joins/hash_join_utils.rs:
##########
@@ -36,24 +36,47 @@ use crate::physical_plan::joins::utils::{JoinFilter, 
JoinSide};
 use datafusion_common::Result;
 
 // Maps a `u64` hash value based on the build side ["on" values] to a list of 
indices with this key's value.
-//
-// Note that the `u64` keys are not stored in the hashmap (hence the `()` as 
key), but are only used
-// to put the indices in a certain bucket.
 // By allocating a `HashMap` with capacity for *at least* the number of rows 
for entries at the build side,
 // we make sure that we don't have to re-hash the hashmap, which needs access 
to the key (the hash in this case) value.
 // E.g. 1 -> [3, 6, 8] indicates that the column values map to rows 3, 6 and 8 
for hash value 1
 // As the key is a hash value, we need to check possible hash collisions in 
the probe stage
 // During this stage it might be the case that a row is contained the same 
hashmap value,
 // but the values don't match. Those are checked in the [equal_rows] macro
-// TODO: speed up collision check and move away from using a hashbrown HashMap
+// The indices (values) are stored in a separate chained list stored in the 
`Vec<u64>`.
+// The first value (+1) is stored in the hashmap, whereas the next value is 
stored in array at the position value.
+// The chain can be followed until the value "0" has been reached, meaning the 
end of the list.
+// Also see chapter 5.3 of [Balancing vectorized query execution with 
bandwidth-optimized 
storage](https://dare.uva.nl/search?identifier=5ccbb60a-38b8-4eeb-858a-e7735dd37487)
+// TODO: speed up collision checks
 // https://github.com/apache/arrow-datafusion/issues/50
-pub struct JoinHashMap(pub RawTable<(u64, SmallVec<[u64; 1]>)>);
+pub struct JoinHashMap {
+    // Stores hash value to first index
+    pub map: RawTable<(u64, u64)>,
+    // Stores indices in chained list data structure
+    pub next: Vec<u64>,
+}
+
+/// SymmetricJoinHashMap is similar to JoinHashMap, except that it stores the 
indices inline, allowing it to mutate
+/// and shrink the indices.
+pub struct SymmetricJoinHashMap(pub RawTable<(u64, SmallVec<[u64; 1]>)>);

Review Comment:
   I might change this since we are not pushing for the same hash table 
implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] metesynnada commented on a diff in pull request #6679: Improve performance/memory usage of HashJoin datastructure (5-15% improvement on selected TPC-H queries)

Reply via email to