alamb opened a new pull request, #11647:
URL: https://github.com/apache/datafusion/pull/11647

   ## Which issue does this PR close?
   
   Related to https://github.com/apache/datafusion/issues/7957 and 
https://github.com/apache/datafusion/issues/11628
   
   ## Rationale for this change
   
   As described on https://github.com/apache/datafusion/issues/7957 and 
https://github.com/apache/datafusion/issues/11628 the current combination of 
filtering / repartition followed by coalesce requires copying the data twice. 
This PR is a prototype to:
   1. See how much better performance would be if  combining the two operations 
into one and avoided a copy
   2. Figure out how big a change it would be / what ht code would look like
   
   This is based on the code in https://github.com/apache/datafusion/pull/11610 
and a bunch of discussion with @XiangpengHao  @edmondop @2010YOUY01 and others
   
   ## Plan
   
   The theory is there is non trivial time spent in coalesce batches and 
repartitioning that we could improve performance by several seconds (almost 1s 
of CPU in several queries query) --- see analysis below
   
   My high level plan is to implement enough of this idea to run some 
ClickBench queries like Q20 Q15 and Q16 and TPCH Q8 and see. If the results are 
promising, I will work to scope out how to make this into real PRs
   
   High level plan:
   - [x] Integrate the BatchCoalescer to `FilterExec`
   - [ ] Integrate the BatchCoalescer to `RepartitionExec`
   - [ ] Test to make sure the results are the same
   - [ ] Test to make sure it doesn't slow things down
   - [ ] Test to make sure the CoalesceBatchesExec doesn't do any work now 
(work is shifted to the FilterExec and RepartitionExec)
   - [ ] Implement some special case coalesce batches for filter
   - [ ] Implement some special case coalesce batches for repartition (`take`)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to