jayzhan211 commented on issue #11680: URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2264680940
Difference between #11762 and main #11762 runs Repartition -> SingleMode group by `main` runs Partial group by-> Repartition -> Final group by For high cardinality (2M rows unique values) #11762 repartitioned to 12, divided equally, each has around ~166000 rows to do single group by (same with final step) `main` run partial aggregate first, there are 256 partition with ~8000 rows each. 256 * 8000 is ~2M. #11762 has less partition, and thus speed up a lot For low cardinality (2M rows with only 4 values) #11762 has 4 partitions, each 0.5M rows `main` run partial aggregate and got 4 partition after repartition, only 1 row to process in final step I can see #11762 is slightly slower than `main` for low cardinality case but the regression is negligible compare to high cardinality case. Next, I want to move repartition within single mode group by. I guess we can see on par result for low cardinality case -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org