Rachelint commented on issue #20773:
URL: https://github.com/apache/datafusion/issues/20773#issuecomment-4017011592

   As I see, these two articles actually talk about how `duckdb` improve 
performance in high cardinality groups aggregation.
   
   But `datafusion` and `duckdb` are too different in aggregation today, 
   and methods in `duckdb` seems to help few about continuing to improve `high 
cardinality groups aggregation` of datafusion?
   
   At first let's see the difference in `high cardinality groups aggregation`.
   
   # Duckdb 0.7
   https://duckdb.org/2022/03/07/aggregate-hashtable#parallel-aggregation
   This article describes the total method to improve performance of high 
cardinality groups aggregation in duckdb 0.7.
   
   And I see, there are some mistakes in it, it use a very different way with 
what mentioned in [Leis et 
al](https://15721.courses.cs.cmu.edu/spring2016/papers/p743-leis.pdf)
   
   - In low cardinality case (the hashtable len in partial aggr <= 10K)
     - use single hashtable in `partial aggr`
     - use single thread to perform `final aggr` (merge hashtables from 
`partial aggr`s)
   
   - In high cardinality case (the hashtable len in partial aggr > 10K)
     - use partitioned hashtable in `partial aggr`
     - use multiple threads to perform parallel `final aggr`(like `final aggr` 
in datafusion)
   
   Actually, it give up cache-efficient of hashtable in `partial aggr`, and use 
the parallel `final aggr` to improve performance.
   
   # Duckdb now
   https://db.in.tum.de/~leis/papers/morsels.pdf
   Duckdb use the similar method in the paper currently.
   
   - In `partial aggr`, it always keep a very small and fixed hashtable
   - and when hashtable in `partial aggr` become too big, it convert the data 
to partitions, and reset the hashtable.
   - In `final aggr`,  like datafusion, always partitioning and merge in 
parallel
   
   We have tried the similar approach in datafusion before, see 
https://github.com/apache/datafusion/issues/6937#issuecomment-1681310199 , but 
found no obvious improvement.
   I guess, it is due to the execution model of datafusion (pull-based and 
async, hard to be cache-efficient).
   
   # Datafusion
   datafusion use `skip partial aggregations` to improve `high cardinality 
groups aggregation` like `doris`.
   
   - In `partial aggr`, we will perform the groups aggregation logic when not 
exceed the threshold at first
   - Then we will totally skip the `partial aggr` if excceed the threshold.
   - So, I guess few obvious improvement can got about optimizing `partial 
aggr` in `datafusion`, because we will totally skip it in `high cardinality 
groups aggregation`.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to