[I] [EPIC]: Perfect Hash Join [datafusion]

via GitHub Wed, 17 Sep 2025 16:26:07 -0700


jonathanc-n opened a new issue, #17635:
URL: https://github.com/apache/datafusion/issues/17635


   ### Is your feature request related to a problem or challenge?
   
   A perfect hash join is a highly specialized form of a hash join that 
eliminates a lot of runtime overhead. It can be thought of as a hybrid between 
a hash join and a direct-lookup array join. 
   
   Here are the conditions for this join and the reasons for why they are needed
   
   Single Join Key
   - Perfect hash join only works when there is exactly one join key per row.
   - Multiple keys would require building composite keys - cannot index into
   
   Integer Key
   - Keys must be integers (or convertible to contiguous integer ranges) 
   - Allows for direct addressing: output[key] = value.
   
   Non-integer keys (strings, UUIDs) would require hashing, which introduces 
collisions and destroys the "perfect" mapping.
   
   Distinct Values on Build Side
   - The build side must have unique keys (no duplicates).
    - This ensures a 1:1 mapping from key → value, allowing direct indexing 
without needing to chain multiple rows in a bucket.
   
   Inner Join
    - Perfect hash join works best with inner joins because you only need to 
store matches.
   - Outer joins would require materializing nulls for missing keys, which 
reintroduces complexity 
   
   With perfect hash join:
   - determine the minimum and maximum key on the build side.
    - Allocate a contiguous array of size max_key - min_key + 1.
    - Store each build row at index key - min_key in the array.
   
   during probe phase: 
   - For each probe key, check if it falls in [min_key, max_key].
   - If yes, directly index into the array to find the corresponding row.
   - If no, skip (no match).
   
   ### Describe the solution you'd like
   
    - Add hash join benchmarks for testing 
    - Use existing accumulator to calculate min/max values
    - Pass flag to `HashJoinExec` to specify if perfect hash join execution is 
needed
        - Can check distinct values from the hash table or use existing 
statistics. 
   
   ## Implement perfect hash join execution.
    - Add stream implementation
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   DuckDB's implementation can be seen here:
   https://github.com/duckdb/duckdb/pull/1959


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [EPIC]: Perfect Hash Join [datafusion]

Reply via email to