alamb commented on issue #11680:
URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2275575262

   I think this data is very interesting and we should look more deeply into 
why is the single group mode faster than doing a repartition / aggregate.
   
   It seems like the only differences are:
   1. There is a `RepartitionExec` and `CoalesceBatchesExec`
   2. The final `AggregateExec` happens in parallel  (but on distinct subsets 
of the group)
   
   I would expect doing the final aggregate in parallel on distinct subsets to 
be about as fast
   
   So one reasonable conclusion conclusion that the overhead of 
`RepartitionExec` and `CoalesceBatchesExec` accounts for the difference 🤔 and 
this if we reduced the Repartition overhead we could see similar improvements 
as the group by single mode 
   
   
   This is the idea behind exploring  
https://github.com/apache/datafusion/pull/11647 -- I think we could avoid a 
copy at the output of CoalesceBatchesExec which would help to reduce the 
overhead


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to