Dandandan opened a new issue #27:
URL: https://github.com/apache/arrow-datafusion/issues/27


   This is (up to 4x in my earlier tests) faster than the current 
implementation that collects all parts to one "full" for cases with very high 
cardinality in the aggregate (think deduplication code). However, not hash 
partitioning is faster for very "simple" aggregates as less work needs to be 
done.
   
   We probably need some fast way to have a rough estimate on the number of 
distinct values in the aggregate keys, maybe dynamically based on the first 
batch(es).
   
   Also this work creates a building block for ballista to distribute data 
across workers, parallelizing it, avoiding collecting it to one worker, and 
making it scale to bigger datasets.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to