Re: [PR] Hash join buffering on probe side [datafusion]

via GitHub Tue, 13 Jan 2026 03:44:23 -0800


Dandandan commented on PR #19761:
URL: https://github.com/apache/datafusion/pull/19761#issuecomment-3743876310


   So in summary about >10% faster on average, but some slowdowns.
   
   * I am wondering what the speedup would be if we just load a single batch 
per partition instead of based on size - perhaps the speedup is similar with 
reduced overhead?
   * While I am excited about the speedup, I wonder if this is not mostly 
"hiding" some "issues" with our current join implementation: hashing and 
concatenation is currently single-threaded for CollectLeft, both could be 
parallelized. Also we could lower the overhead of the parallel hash join and 
reduce the threshold to make the total query run more in parallel and avoid 
time spent waiting on one not well parallelizable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Hash join buffering on probe side [datafusion]

Reply via email to