neilconway opened a new issue, #22854: URL: https://github.com/apache/datafusion/issues/22854
### Is your feature request related to a problem or challenge? Suppose we have A semijoin B, and the optimizer chooses to implement that as as a `RightSemi` join. That means we build B and stream A, doing lookups against the B hash table. The current implementation does the following: 1. For a given A value, we lookup all the matching B values, walking the hash chain and checking for equality. 2. For each match, produce a `(probe_idx, build_idx)` pair. 3. Apply the join filter if there's a non-equijoin filter 4. Remove duplicates from the list of all `probe_idx` values 5. Materialize the output `RecordBatch` using the distinct `probe_idx` values In part we do this because we share code with other join implementations, but some of this work is redundant for `RightSemi` joins (and `RightAnti` joins as well): 1. We can stop once we hit the first matching B value, rather than walking the rest of the chain 2. We never need `build_idx` values 3. We don't need to remove duplicates from the `probe_idx` array, since we never produced them in the first place We could go further and specialize the hash table for semijoins: we don't need to keep duplicate build-side values in the hash table in the first place, the `visited` bitmap, and so on. But just optimizing the join algorithm itself should get us most of the win. ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
