jonathanc-n opened a new issue, #17635:
URL: https://github.com/apache/datafusion/issues/17635
### Is your feature request related to a problem or challenge?
A perfect hash join is a highly specialized form of a hash join that
eliminates a lot of runtime overhead. It can be thought of as a hybrid between
a hash join and a direct-lookup array join.
Here are the conditions for this join and the reasons for why they are needed
Single Join Key
- Perfect hash join only works when there is exactly one join key per row.
- Multiple keys would require building composite keys - cannot index into
Integer Key
- Keys must be integers (or convertible to contiguous integer ranges)
- Allows for direct addressing: output[key] = value.
Non-integer keys (strings, UUIDs) would require hashing, which introduces
collisions and destroys the "perfect" mapping.
Distinct Values on Build Side
- The build side must have unique keys (no duplicates).
- This ensures a 1:1 mapping from key → value, allowing direct indexing
without needing to chain multiple rows in a bucket.
Inner Join
- Perfect hash join works best with inner joins because you only need to
store matches.
- Outer joins would require materializing nulls for missing keys, which
reintroduces complexity
With perfect hash join:
- determine the minimum and maximum key on the build side.
- Allocate a contiguous array of size max_key - min_key + 1.
- Store each build row at index key - min_key in the array.
during probe phase:
- For each probe key, check if it falls in [min_key, max_key].
- If yes, directly index into the array to find the corresponding row.
- If no, skip (no match).
### Describe the solution you'd like
- Add hash join benchmarks for testing
- Use existing accumulator to calculate min/max values
- Pass flag to `HashJoinExec` to specify if perfect hash join execution is
needed
- Can check distinct values from the hash table or use existing
statistics.
## Implement perfect hash join execution.
- Add stream implementation
### Describe alternatives you've considered
_No response_
### Additional context
DuckDB's implementation can be seen here:
https://github.com/duckdb/duckdb/pull/1959
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]