Re: [PR] Hash join buffering on probe side [datafusion]

via GitHub Tue, 13 Jan 2026 05:45:10 -0800


gabotechs commented on PR #19761:
URL: https://github.com/apache/datafusion/pull/19761#issuecomment-3744418572


   > When this is implemented, we might want to look at 
hash_join_single_partition_threshold and 
hash_join_single_partition_threshold_rows again which could be reduced to make 
most joins run fully in parallel.
   
   I do expect buffering to have a positive impact even if all optimizations 
you mentioned are shipped. Buffering has a much greater impact in real 
scenarios, where the IO component is way heavier as data might be stored in a 
bucket or in a remote resource like an API, I was actually surprised to see 
that there's a non negligible impact if running benchmarks against local files.
   
   Regardless of the order of events, this PR still needs work, it should not 
imply slowdowns in any of the current benchmarks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Hash join buffering on probe side [datafusion]

Reply via email to