Dandandan opened a new issue, #18942:
URL: https://github.com/apache/datafusion/issues/18942

   > Thanks! Besides looking at optimizing the join order during planning time 
or dynamic (I think there are a couple of issues covering that), we can look at 
what makes the operator slow in more challenging scenario's.
   > 
   > Some optimizations for the current operator come to mind that might 
improve the current hash join operator in certain scenario's, while keeping the 
same algorithm:
   > 
   > * Reuse the allocation of `Vec` indices between calls. This probably helps 
when the amount of matching indices is low (compared to the batch size).
   > * (Related): Keep building matching indices until `limit` rows have been 
reached and use `interleave` to collect the batches. That probably makes the 
operator more cache efficient as accessing the map / chain is done at the same 
time, before producing output batches from the input data. This also helps with 
avoiding the overhead of `CoalesceBatches`, which probably helps as well.
   > * Instead of building indices for the right side, we can build a boolean 
mask / filter to mark match / no match. This reduces memory usage (somewhat) 
plus a boolean filter is much faster for low selectivity (i.e. most of the 
right side matches). We then should use the coalesce kernel to produce the 
right side arrays.
   > 
   > I opened https://github.com/apache/datafusion/issues/18939 for exploring 
to use a different algorithm (radix hash joins), which additionally should 
improve the performance of our join operators by making the algorithm more 
cache efficient. 
   
    _Originally posted by @Dandandan in 
[#17494](https://github.com/apache/datafusion/issues/17494#issuecomment-3581011421)_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to