Re: [PR] Hash join buffering on probe side [datafusion]

via GitHub Tue, 13 Jan 2026 05:41:59 -0800


gabotechs commented on PR #19761:
URL: https://github.com/apache/datafusion/pull/19761#issuecomment-3744403441


   > I am wondering what the speedup would be if we just load a single batch 
per partition instead of based on size - perhaps the speedup is similar with 
reduced overhead?
   
   Tested it locally and it does not seem to have a significant impact. Even 
doing an unbounded buffering has no significant impact. What I get from that is 
that the main driver for speed is the fact that something forces the probe side 
`RecordBatchStream`s to make progress whether that implies buffering actual 
record batches or not.
   
   > While I am excited about the speedup, I wonder if this is not mostly 
"hiding" some "issues" with our current join implementation: hashing and 
concatenation is currently single-threaded for CollectLeft, both could be 
parallelized. Also we could lower the overhead of the parallel hash join and 
reduce the threshold to make the total query run more in parallel and avoid 
time spent waiting on one not well parallelizable.
   
   This can indeed reduce the impact of any issue affecting build side creation 
speed. I would not say "hiding" as the issues should still be visible, just 
that the overall query latency is no longer the best metric to discover those.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Hash join buffering on probe side [datafusion]

Reply via email to