Re: [PR] rfc: optional skipping partial aggregation [datafusion]

via GitHub Tue, 30 Jul 2024 14:02:35 -0700


alamb commented on PR #11627:
URL: https://github.com/apache/datafusion/pull/11627#issuecomment-2259204056


   I have been playing with this PR more. On my 8 core test machine on GCP, I 
am running
   
   ```sql
   set datafusion.execution.target_partitions = 90;
   SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh") FROM hits 
GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;
   ```
   
   The actual command:
   ```shell
   ./datafusion-cli  -c 'set datafusion.execution.target_partitions = 90; 
SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh") FROM hits GROUP 
BY "\
   WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;'
   ```
   
   On this branch, I reliably see it use 8GB peak memory and take around 10 
seconds:
   
   8GB max
   10 row(s) fetched.
   Elapsed 10.073 seconds.
   Elapsed 9.880 seconds.
   Elapsed 9.939 seconds.
   
   When running the same command on main I see it reliably use 12GB of memory 
and take 14 seconds
   
   12GB peak
   Elapsed 14.069 seconds.
   Elapsed 14.018 seconds.
   Elapsed 14.078 seconds.
   
   Therefore I conclude (again) this branch is a substantial improvement
   for high cardinality aggregates on many cores and therefore I think we 
should merge it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] rfc: optional skipping partial aggregation [datafusion]

Reply via email to