comphead commented on PR #6065:
URL:
https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1520534020
I have also double checked from my side with latest main which also includes
row_hash optimizations from @mingmwang
Testing is against 13G `hits.parquet` file, 2 runs(cold run, warm run)
datafusion-cli built in release mode
Machine 2.4 GHz 8-Core Intel Core i9 MacOS
With this PR
```
DataFusion CLI v23.0.0
❯ CREATE EXTERNAL TABLE hits STORED AS PARQUET location 'hits.parquet';
0 rows in set. Query took 0.079 seconds.
❯ SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"),
AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC
LIMIT 10;
+---------------------+-------------+---+---------------------+---------------------------+
| WatchID | ClientIP | c | SUM(hits.IsRefresh) |
AVG(hits.ResolutionWidth) |
+---------------------+-------------+---+---------------------+---------------------------+
| 6655575552203051303 | 1611957945 | 2 | 0 | 1638.0
|
| 7904046282518428963 | 1509330109 | 2 | 0 | 1368.0
|
| 8566928176839891583 | -1402644643 | 2 | 0 | 1368.0
|
| 7224410078130478461 | -776509581 | 2 | 0 | 1368.0
|
| 7968574085024155935 | -986722817 | 1 | 0 | 1990.0
|
| 8683116696854507598 | -1887352109 | 1 | 0 | 1917.0
|
| 6071982018954122379 | 1154898388 | 1 | 0 | 1638.0
|
| 7044330683984323480 | -765736418 | 1 | 0 | 1750.0
|
| 5170668904757974782 | 580435115 | 1 | 0 | 1087.0
|
| 7121372218861667575 | -888761092 | 1 | 0 | 1368.0
|
+---------------------+-------------+---+---------------------+---------------------------+
10 rows in set. Query took 51.300 seconds.
❯ SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"),
AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC
LIMIT 10;
+---------------------+-------------+---+---------------------+---------------------------+
| WatchID | ClientIP | c | SUM(hits.IsRefresh) |
AVG(hits.ResolutionWidth) |
+---------------------+-------------+---+---------------------+---------------------------+
| 6655575552203051303 | 1611957945 | 2 | 0 | 1638.0
|
| 7904046282518428963 | 1509330109 | 2 | 0 | 1368.0
|
| 8566928176839891583 | -1402644643 | 2 | 0 | 1368.0
|
| 7224410078130478461 | -776509581 | 2 | 0 | 1368.0
|
| 5811527790243312578 | 56896075 | 1 | 0 | 1368.0
|
| 5998597912391672099 | 807012274 | 1 | 0 | 1996.0
|
| 6071982018954122379 | 1154898388 | 1 | 0 | 1638.0
|
| 7044330683984323480 | -765736418 | 1 | 0 | 1750.0
|
| 5170668904757974782 | 580435115 | 1 | 0 | 1087.0
|
| 7121372218861667575 | -888761092 | 1 | 0 | 1368.0
|
+---------------------+-------------+---+---------------------+---------------------------+
10 rows in set. Query took 47.066 seconds.
```
Without PR
```
DataFusion CLI v23.0.0
❯ CREATE EXTERNAL TABLE hits STORED AS PARQUET location 'hits.parquet';
0 rows in set. Query took 0.066 seconds.
❯ SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"),
AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC
LIMIT 10;
+---------------------+-------------+---+---------------------+---------------------------+
| WatchID | ClientIP | c | SUM(hits.IsRefresh) |
AVG(hits.ResolutionWidth) |
+---------------------+-------------+---+---------------------+---------------------------+
| 6655575552203051303 | 1611957945 | 2 | 0 | 1638.0
|
| 7904046282518428963 | 1509330109 | 2 | 0 | 1368.0
|
| 8566928176839891583 | -1402644643 | 2 | 0 | 1368.0
|
| 7224410078130478461 | -776509581 | 2 | 0 | 1368.0
|
| 7876723297163966966 | -1495669395 | 1 | 0 | 1750.0
|
| 9012818526311489736 | 1663619136 | 1 | 0 | 1638.0
|
| 6071982018954122379 | 1154898388 | 1 | 0 | 1638.0
|
| 7044330683984323480 | -765736418 | 1 | 0 | 1750.0
|
| 5170668904757974782 | 580435115 | 1 | 0 | 1087.0
|
| 7121372218861667575 | -888761092 | 1 | 0 | 1368.0
|
+---------------------+-------------+---+---------------------+---------------------------+
10 rows in set. Query took 50.776 seconds.
❯ SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"),
AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC
LIMIT 10;
+---------------------+-------------+---+---------------------+---------------------------+
| WatchID | ClientIP | c | SUM(hits.IsRefresh) |
AVG(hits.ResolutionWidth) |
+---------------------+-------------+---+---------------------+---------------------------+
| 6655575552203051303 | 1611957945 | 2 | 0 | 1638.0
|
| 7904046282518428963 | 1509330109 | 2 | 0 | 1368.0
|
| 8566928176839891583 | -1402644643 | 2 | 0 | 1368.0
|
| 7224410078130478461 | -776509581 | 2 | 0 | 1368.0
|
| 5981193970486754997 | -1725568709 | 1 | 0 | 1917.0
|
| 6089584153122015492 | 1209163516 | 1 | 0 | 1996.0
|
| 6071982018954122379 | 1154898388 | 1 | 0 | 1638.0
|
| 7044330683984323480 | -765736418 | 1 | 0 | 1750.0
|
| 5170668904757974782 | 580435115 | 1 | 0 | 1087.0
|
| 7121372218861667575 | -888761092 | 1 | 0 | 1368.0
|
+---------------------+-------------+---+---------------------+---------------------------+
10 rows in set. Query took 48.886 seconds.
```
Before #6003 DF took 62 and 58 sec respectively.
That mean @mingmwang fixes performance and `slice_and_maybe_filter` was not
a bottleneck anymore so this PR doesn't bring any benefit. I will close it.
Thanks all for participating
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]