Dandandan commented on issue #17494:
URL: https://github.com/apache/datafusion/issues/17494#issuecomment-3581011421

   Thanks! Besides looking at optimizing the join order during planning time or 
dynamic (I think there are a couple of issues covering that), we can look at 
what makes the operator slow in more challenging scenario's.
   
   Some optimizations for the current operator come to mind that might improve 
the current hash join operator in certain scenario's, while keeping the same 
algorithm:
   
   * Reuse the allocation of `Vec` indices between calls. This probably helps 
when the amount of matching indices is low (compared to the batch size).
   * (Related): Keep building matching indices until `limit` rows have been 
reached and use `interleave` to collect the batches. That probably makes the 
operator more cache efficient as accessing the map / chain is done at the same 
time, before producing batches. This also helps with avoiding the overhead of 
`CoalesceBatches`, which probably helps as well.
   * Instead of building indices for the right side, we can build a boolean 
mask / filter to mark match / no match. This reduces memory usage (somewhat) 
plus a boolean filter is much faster for low selectivity (i.e. most of the 
right side matches). We then should use the coalesce kernel to produce the 
right side arrays.
   
   I opened https://github.com/apache/datafusion/issues/18939 for exploring to 
use a different algorithm (radix hash joins), which additionally should improve 
the performance of our join operators by making the algorithm more cache 
efficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to