jonathanc-n commented on PR #17197: URL: https://github.com/apache/datafusion/pull/17197#issuecomment-3190624755
In the case of hash join spilling this might be a bit difficult. I'm planning on putting out a proposal for hash join spilling in the next few days. To give you a quick rundown the idea is to essentially partition the data and when there is a need to spill we can write some partitions to disk, and after the first hash join exec is done, it will read spilled batches from disk and run again. Since we can't use the entire batch to compute the bounds upfront I was thinking we could possibly just keep a running min/max count as we load in the batches. But this would remove the efficient min/max computations we can get with the min_batch/max_batch (however I don't know how much faster they actually are). Due to this reason I think the first hash spilling pull request won't include support for filter pushdown, or we can discuss this once I open the proposal -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org