Dandandan commented on issue #17494: URL: https://github.com/apache/datafusion/issues/17494#issuecomment-3581011421
Thanks! Besides looking at optimizing the join order during planning time or dynamic (I think there are a couple of issues covering that), we can look at what makes the operator slow in more challenging scenario's. Some optimizations for the current operator come to mind that might improve the current hash join operator in certain scenario's, while keeping the same algorithm: * Reuse the allocation of `Vec` indices between calls. This probably helps when the amount of matching indices is low (compared to the batch size). * (Related): Keep building matching indices until `limit` rows have been reached and use `interleave` to collect the batches. That probably makes the operator more cache efficient as accessing the map / chain is done at the same time, before producing batches. This also helps with avoiding the overhead of `CoalesceBatches`, which probably helps as well. * Instead of building indices for the right side, we can build a boolean mask / filter to mark match / no match. This reduces memory usage (somewhat) plus a boolean filter is much faster for low selectivity (i.e. most of the right side matches). We then should use the coalesce kernel to produce the right side arrays. I opened https://github.com/apache/datafusion/issues/18939 for exploring to use a different algorithm (radix hash joins), which additionally should improve the performance of our join operators by making the algorithm more cache efficient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
