neilconway opened a new issue, #22854:
URL: https://github.com/apache/datafusion/issues/22854

   ### Is your feature request related to a problem or challenge?
   
   Suppose we have A semijoin B, and the optimizer chooses to implement that as 
as a `RightSemi` join. That means we build B and stream A, doing lookups 
against the B hash table. The current implementation does the following:
   
   1. For a given A value, we lookup all the matching B values, walking the 
hash chain and checking for equality.
   2. For each match, produce a `(probe_idx, build_idx)` pair.
   3. Apply the join filter if there's a non-equijoin filter
   4. Remove duplicates from the list of all `probe_idx` values
   5. Materialize the output `RecordBatch` using the distinct `probe_idx` values
   
   In part we do this because we share code with other join implementations, 
but some of this work is redundant for `RightSemi` joins (and `RightAnti` joins 
as well):
   
   1. We can stop once we hit the first matching B value, rather than walking 
the rest of the chain
   2. We never need `build_idx` values
   3. We don't need to remove duplicates from the `probe_idx` array, since we 
never produced them in the first place
   
   We could go further and specialize the hash table for semijoins: we don't 
need to keep duplicate build-side values in the hash table in the first place, 
the `visited` bitmap, and so on. But just optimizing the join algorithm itself 
should get us most of the win.
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to