2010YOUY01 commented on PR #21817: URL: https://github.com/apache/datafusion/pull/21817#issuecomment-4637581158
I think this solution is faster mainly because the current physical layout is not ideal for this access pattern, rather than because RoaringBitmap is inherently a better data structure for this workload. My preferred long-term solution would be to implement a hash table specialized for semi/anti joins. I believe that would be not only faster, but also more general, since the optimization could apply to all data types. One related idea is to fully separate the semi/anti join path from the existing hash join implementation. I think this would make both paths more organized and potentially more performant: see a related issue https://github.com/apache/datafusion/issues/22710 That said, I realize my optimization philosophy here is a bit greedy: if I see a better long-term solution, I tend to go directly for it and try to avoid introducing additional complexity along the way. Others may prefer to iterate through smaller incremental optimizations first, and I have no objection to that approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
