Re: [PR] Fix HashJoinExec sideways information passing for partitioned queries [datafusion]

via GitHub Thu, 14 Aug 2025 22:06:53 -0700


jonathanc-n commented on PR #17197:
URL: https://github.com/apache/datafusion/pull/17197#issuecomment-3190624755


   In the case of hash join spilling this might be a bit difficult. I'm 
planning on putting out a proposal for hash join spilling in the next few days. 
To give you a quick rundown the idea is to essentially partition the data and 
when there is a need to spill we can write some partitions to disk, and after 
the first hash join exec is done, it will read spilled batches from disk and 
run again. 
   
   Since we can't use the entire batch to compute the bounds upfront I was 
thinking we could possibly just keep a running min/max count as we load in the 
batches. But this would remove the efficient min/max computations we can get 
with the min_batch/max_batch (however I don't know how much faster they 
actually are). 
   
   Due to this reason I think the first hash spilling pull request won't 
include support for filter pushdown, or we can discuss this once I open the 
proposal
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix HashJoinExec sideways information passing for partitioned queries [datafusion]

Reply via email to