Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

via GitHub Thu, 01 Aug 2024 23:48:16 -0700


jayzhan211 commented on issue #11680:
URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2264680940


   Difference between #11762 and main
   
   #11762 runs Repartition -> SingleMode group by
   `main` runs Partial group by-> Repartition -> Final group by
   
   For high cardinality (2M rows unique values)
   #11762 repartitioned to 12, divided equally, each has around ~166000 rows to 
do single group by (same with final step)
   `main` run partial aggregate first, there are 256 partition with ~8000 rows 
each. 256 * 8000 is ~2M.
   
   #11762 has less partition, and thus speed up a lot
   
   For low cardinality (2M rows with only 4 values)
   #11762 has 4 partitions, each 0.5M rows
   `main` run partial aggregate and got 4 partition after repartition, only 1 
row to process in final step
   
   I can see #11762 is slightly slower than `main` for low cardinality case but 
the regression is negligible compare to high cardinality case.
   
   
   Next, I want to move repartition within single mode group by. I guess we can 
see on par result for low cardinality case
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

Reply via email to