Re: [PR] Making stream joins extensible: A new Trait implementation for SHJ [arrow-datafusion]

via GitHub Sun, 19 Nov 2023 23:56:39 -0800


metesynnada commented on code in PR #8234:
URL: https://github.com/apache/arrow-datafusion/pull/8234#discussion_r1398773665



##########
datafusion/physical-plan/src/joins/utils.rs:
##########
@@ -50,8 +52,135 @@ use datafusion_physical_expr::{
 
 use futures::future::{BoxFuture, Shared};
 use futures::{ready, FutureExt};
+use hashbrown::raw::RawTable;
 use parking_lot::Mutex;
 
+/// Maps a `u64` hash value based on the build side ["on" values] to a list of 
indices with this key's value.
+///
+/// By allocating a `HashMap` with capacity for *at least* the number of rows 
for entries at the build side,
+/// we make sure that we don't have to re-hash the hashmap, which needs access 
to the key (the hash in this case) value.
+///
+/// E.g. 1 -> [3, 6, 8] indicates that the column values map to rows 3, 6 and 
8 for hash value 1
+/// As the key is a hash value, we need to check possible hash collisions in 
the probe stage
+/// During this stage it might be the case that a row is contained the same 
hashmap value,
+/// but the values don't match. Those are checked in the 
[`equal_rows_arr`](crate::joins::hash_join::equal_rows_arr) method.
+///
+/// The indices (values) are stored in a separate chained list stored in the 
`Vec<u64>`.
+///
+/// The first value (+1) is stored in the hashmap, whereas the next value is 
stored in array at the position value.
+///
+/// The chain can be followed until the value "0" has been reached, meaning 
the end of the list.
+/// Also see chapter 5.3 of [Balancing vectorized query execution with 
bandwidth-optimized 
storage](https://dare.uva.nl/search?identifier=5ccbb60a-38b8-4eeb-858a-e7735dd37487)
+///
+/// # Example
+///
+/// ``` text
+/// See the example below:
+///
+/// Insert (10,1)            <-- insert hash value 10 with row index 1
+/// map:
+/// ----------
+/// | 10 | 2 |
+/// ----------
+/// next:
+/// ---------------------
+/// | 0 | 0 | 0 | 0 | 0 |
+/// ---------------------
+/// Insert (20,2)
+/// map:
+/// ----------
+/// | 10 | 2 |
+/// | 20 | 3 |
+/// ----------
+/// next:
+/// ---------------------
+/// | 0 | 0 | 0 | 0 | 0 |
+/// ---------------------
+/// Insert (10,3)           <-- collision! row index 3 has a hash value of 10 
as well
+/// map:
+/// ----------
+/// | 10 | 4 |
+/// | 20 | 3 |
+/// ----------
+/// next:
+/// ---------------------
+/// | 0 | 0 | 0 | 2 | 0 |  <--- hash value 10 maps to 4,2 (which means indices 
values 3,1)
+/// ---------------------
+/// Insert (10,4)          <-- another collision! row index 4 ALSO has a hash 
value of 10
+/// map:
+/// ---------
+/// | 10 | 5 |
+/// | 20 | 3 |
+/// ---------
+/// next:
+/// ---------------------
+/// | 0 | 0 | 0 | 2 | 4 | <--- hash value 10 maps to 5,4,2 (which means 
indices values 4,3,1)
+/// ---------------------
+/// ```
+pub struct JoinHashMap {

Review Comment:
   It is only struct that hash join uses different from other joins, this is 
why I put it under utils. I can create a new folder as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Making stream joins extensible: A new Trait implementation for SHJ [arrow-datafusion]

Reply via email to