Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

via GitHub Thu, 08 Aug 2024 04:20:06 -0700


alamb commented on issue #11680:
URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2275575262


   I think this data is very interesting and we should look more deeply into 
why is the single group mode faster than doing a repartition / aggregate.
   
   It seems like the only differences are:
   1. There is a `RepartitionExec` and `CoalesceBatchesExec`
   2. The final `AggregateExec` happens in parallel  (but on distinct subsets 
of the group)
   
   I would expect doing the final aggregate in parallel on distinct subsets to 
be about as fast
   
   So one reasonable conclusion conclusion that the overhead of 
`RepartitionExec` and `CoalesceBatchesExec` accounts for the difference 🤔 and 
this if we reduced the Repartition overhead we could see similar improvements 
as the group by single mode 
   
   
   This is the idea behind exploring  
https://github.com/apache/datafusion/pull/11647 -- I think we could avoid a 
copy at the output of CoalesceBatchesExec which would help to reduce the 
overhead


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

Reply via email to