alamb opened a new pull request, #11647: URL: https://github.com/apache/datafusion/pull/11647
## Which issue does this PR close? Related to https://github.com/apache/datafusion/issues/7957 and https://github.com/apache/datafusion/issues/11628 ## Rationale for this change As described on https://github.com/apache/datafusion/issues/7957 and https://github.com/apache/datafusion/issues/11628 the current combination of filtering / repartition followed by coalesce requires copying the data twice. This PR is a prototype to: 1. See how much better performance would be if combining the two operations into one and avoided a copy 2. Figure out how big a change it would be / what ht code would look like This is based on the code in https://github.com/apache/datafusion/pull/11610 and a bunch of discussion with @XiangpengHao @edmondop @2010YOUY01 and others ## Plan The theory is there is non trivial time spent in coalesce batches and repartitioning that we could improve performance by several seconds (almost 1s of CPU in several queries query) --- see analysis below My high level plan is to implement enough of this idea to run some ClickBench queries like Q20 Q15 and Q16 and TPCH Q8 and see. If the results are promising, I will work to scope out how to make this into real PRs High level plan: - [x] Integrate the BatchCoalescer to `FilterExec` - [ ] Integrate the BatchCoalescer to `RepartitionExec` - [ ] Test to make sure the results are the same - [ ] Test to make sure it doesn't slow things down - [ ] Test to make sure the CoalesceBatchesExec doesn't do any work now (work is shifted to the FilterExec and RepartitionExec) - [ ] Implement some special case coalesce batches for filter - [ ] Implement some special case coalesce batches for repartition (`take`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
