Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

via GitHub Thu, 05 Sep 2024 07:54:31 -0700


Rachelint commented on issue #11680:
URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2331929376


   > > I guess the reason why performance improved may be simlar as the partial 
skipping
   > 
   > Yes, it is why I experiment with single mode, force to avoid partial and 
repartition stage for all query, sadly, this doesn't work well for low 
cardinality case
   > 
   > > We would probably need to consolidate Aggregate(Partial and Final) and 
Repartition into a single place in order to be able to adaptively choose 
aggregate mode/algorithm based on runtime statistics.
   > 
   > I agree, similar to my idea before.
   > 
   > > Alternative idea for improvement is, if we can combine partial group + 
repartition + final group in one operation. We could probably avoid converting 
to row once again in final group.
   > 
   > However, the refactor is quite challenging
   
   For aggr, It may be used to perform the parallel merging in final aggr  from 
partial aggr.
   In my knowledge, duck seems use partitioned hashtable to perform the similar 
mechanism?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

Reply via email to