Re: [PR] Fast path for joins with distinct values in build side [datafusion]

via GitHub Sat, 24 May 2025 01:57:40 -0700


Dandandan commented on PR #16153:
URL: https://github.com/apache/datafusion/pull/16153#issuecomment-2906647265


   > This optimization is neat and already covers the common case of joins on 
primary keys. I think we can further optimize the join hash table - even for 
cases where _some_ keys might have chains. Instead of looking for a 0 value in 
the `next` vector, we can encode whether there is a next value in the top bit 
of the current slot - thus saving a lookup in the `next` vector on every probe 
that has at least a single match.
   > 
   > I don't know how well this plays with the streaming join hash map though =)
   
   That sounds like a neat thing to try! Another (smaller) optimization I can 
think of is encode hashmap and next list with `u32` indices / offsets if 
possible (so it fits more easily in CPU cache by halving the data).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fast path for joins with distinct values in build side [datafusion]

Reply via email to