jayzhan211 commented on issue #11680:
URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2264680940

   Difference between #11762 and main
   
   #11762 runs Repartition -> SingleMode group by
   `main` runs Partial group by-> Repartition -> Final group by
   
   For high cardinality (2M rows unique values)
   #11762 repartitioned to 12, divided equally, each has around ~166000 rows to 
do single group by (same with final step)
   `main` run partial aggregate first, there are 256 partition with ~8000 rows 
each. 256 * 8000 is ~2M.
   
   #11762 has less partition, and thus speed up a lot
   
   For low cardinality (2M rows with only 4 values)
   #11762 has 4 partitions, each 0.5M rows
   `main` run partial aggregate and got 4 partition after repartition, only 1 
row to process in final step
   
   I can see #11762 is slightly slower than `main` for low cardinality case but 
the regression is negligible compare to high cardinality case.
   
   
   Next, I want to move repartition within single mode group by. I guess we can 
see on par result for low cardinality case
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to