alamb commented on PR #11627: URL: https://github.com/apache/datafusion/pull/11627#issuecomment-2254523568
I also tried out Q32 (that has AVG so can't use this optimization yet) but removed the `AVG` and set target partitions to something silly. I see this PR making a substantial difference (6s vs 7s) ### 1000 partitions, this PR ```shell andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ ./datafusion-cli-skip-partial -c "set datafusion.execution.target_partitions = 1000; SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\") FROM 'hits.parquet' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;" Elapsed 0.001 seconds. +---------------------+-------------+---+-----------------------------+ | WatchID | ClientIP | c | sum(hits.parquet.IsRefresh) | +---------------------+-------------+---+-----------------------------+ | 7904046282518428963 | 1509330109 | 2 | 0 | | 8566928176839891583 | -1402644643 | 2 | 0 | | 6655575552203051303 | 1611957945 | 2 | 0 | | 7224410078130478461 | -776509581 | 2 | 0 | | 9102894172721185728 | 1489622498 | 1 | 1 | | 8964981845434484863 | 1822336830 | 1 | 0 | | 6991883311913569583 | -745122562 | 1 | 0 | | 6787783378461221127 | -506600142 | 1 | 0 | | 6042898921955304644 | 2054220936 | 1 | 0 | | 5581365862985039198 | 104944290 | 1 | 0 | +---------------------+-------------+---+-----------------------------+ 10 row(s) fetched. Elapsed 6.378 seconds. ``` ### 1000 partitions, main ```shell andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -c "set datafusion.execution.target_partitions = 1000; SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\") FROM 'hits.parquet' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;" DataFusion CLI v40.0.0 0 row(s) fetched. Elapsed 0.002 seconds. +---------------------+-------------+---+-----------------------------+ | WatchID | ClientIP | c | sum(hits.parquet.IsRefresh) | +---------------------+-------------+---+-----------------------------+ | 7904046282518428963 | 1509330109 | 2 | 0 | | 8566928176839891583 | -1402644643 | 2 | 0 | | 6655575552203051303 | 1611957945 | 2 | 0 | | 7224410078130478461 | -776509581 | 2 | 0 | | 6780795588237729988 | 1894276368 | 1 | 1 | | 6158430646513894356 | -1557291761 | 1 | 0 | | 8433113762047612962 | 1214823432 | 1 | 0 | | 8783130976633619349 | 1072197582 | 1 | 0 | | 4959259883895284379 | 2023656393 | 1 | 0 | | 6328586531975293675 | 1549952556 | 1 | 1 | +---------------------+-------------+---+-----------------------------+ 10 row(s) fetched. Elapsed 7.771 seconds. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org