comphead commented on PR #6065:
URL: 
https://github.com/apache/arrow-datafusion/pull/6065#issuecomment-1520534020

   I have also double checked from my side with latest main which also includes 
row_hash optimizations from @mingmwang 
   Testing is against 13G `hits.parquet` file, 2 runs(cold run, warm run)
   
   datafusion-cli built in release mode
   Machine 2.4 GHz 8-Core Intel Core i9 MacOS
   
   With this PR
   ```
   DataFusion CLI v23.0.0
   ❯ CREATE EXTERNAL TABLE hits STORED AS PARQUET location 'hits.parquet';
   0 rows in set. Query took 0.079 seconds.
   ❯ SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), 
AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC 
LIMIT 10;
   
+---------------------+-------------+---+---------------------+---------------------------+
   | WatchID             | ClientIP    | c | SUM(hits.IsRefresh) | 
AVG(hits.ResolutionWidth) |
   
+---------------------+-------------+---+---------------------+---------------------------+
   | 6655575552203051303 | 1611957945  | 2 | 0                   | 1638.0       
             |
   | 7904046282518428963 | 1509330109  | 2 | 0                   | 1368.0       
             |
   | 8566928176839891583 | -1402644643 | 2 | 0                   | 1368.0       
             |
   | 7224410078130478461 | -776509581  | 2 | 0                   | 1368.0       
             |
   | 7968574085024155935 | -986722817  | 1 | 0                   | 1990.0       
             |
   | 8683116696854507598 | -1887352109 | 1 | 0                   | 1917.0       
             |
   | 6071982018954122379 | 1154898388  | 1 | 0                   | 1638.0       
             |
   | 7044330683984323480 | -765736418  | 1 | 0                   | 1750.0       
             |
   | 5170668904757974782 | 580435115   | 1 | 0                   | 1087.0       
             |
   | 7121372218861667575 | -888761092  | 1 | 0                   | 1368.0       
             |
   
+---------------------+-------------+---+---------------------+---------------------------+
   10 rows in set. Query took 51.300 seconds.
   ❯ SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), 
AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC 
LIMIT 10;
   
+---------------------+-------------+---+---------------------+---------------------------+
   | WatchID             | ClientIP    | c | SUM(hits.IsRefresh) | 
AVG(hits.ResolutionWidth) |
   
+---------------------+-------------+---+---------------------+---------------------------+
   | 6655575552203051303 | 1611957945  | 2 | 0                   | 1638.0       
             |
   | 7904046282518428963 | 1509330109  | 2 | 0                   | 1368.0       
             |
   | 8566928176839891583 | -1402644643 | 2 | 0                   | 1368.0       
             |
   | 7224410078130478461 | -776509581  | 2 | 0                   | 1368.0       
             |
   | 5811527790243312578 | 56896075    | 1 | 0                   | 1368.0       
             |
   | 5998597912391672099 | 807012274   | 1 | 0                   | 1996.0       
             |
   | 6071982018954122379 | 1154898388  | 1 | 0                   | 1638.0       
             |
   | 7044330683984323480 | -765736418  | 1 | 0                   | 1750.0       
             |
   | 5170668904757974782 | 580435115   | 1 | 0                   | 1087.0       
             |
   | 7121372218861667575 | -888761092  | 1 | 0                   | 1368.0       
             |
   
+---------------------+-------------+---+---------------------+---------------------------+
   10 rows in set. Query took 47.066 seconds.
   ```
   
   Without PR
   ```
   DataFusion CLI v23.0.0
   ❯ CREATE EXTERNAL TABLE hits STORED AS PARQUET location 'hits.parquet';
   0 rows in set. Query took 0.066 seconds.
   ❯ SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), 
AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC 
LIMIT 10;
   
+---------------------+-------------+---+---------------------+---------------------------+
   | WatchID             | ClientIP    | c | SUM(hits.IsRefresh) | 
AVG(hits.ResolutionWidth) |
   
+---------------------+-------------+---+---------------------+---------------------------+
   | 6655575552203051303 | 1611957945  | 2 | 0                   | 1638.0       
             |
   | 7904046282518428963 | 1509330109  | 2 | 0                   | 1368.0       
             |
   | 8566928176839891583 | -1402644643 | 2 | 0                   | 1368.0       
             |
   | 7224410078130478461 | -776509581  | 2 | 0                   | 1368.0       
             |
   | 7876723297163966966 | -1495669395 | 1 | 0                   | 1750.0       
             |
   | 9012818526311489736 | 1663619136  | 1 | 0                   | 1638.0       
             |
   | 6071982018954122379 | 1154898388  | 1 | 0                   | 1638.0       
             |
   | 7044330683984323480 | -765736418  | 1 | 0                   | 1750.0       
             |
   | 5170668904757974782 | 580435115   | 1 | 0                   | 1087.0       
             |
   | 7121372218861667575 | -888761092  | 1 | 0                   | 1368.0       
             |
   
+---------------------+-------------+---+---------------------+---------------------------+
   10 rows in set. Query took 50.776 seconds.
   ❯ SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), 
AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC 
LIMIT 10;
   
+---------------------+-------------+---+---------------------+---------------------------+
   | WatchID             | ClientIP    | c | SUM(hits.IsRefresh) | 
AVG(hits.ResolutionWidth) |
   
+---------------------+-------------+---+---------------------+---------------------------+
   | 6655575552203051303 | 1611957945  | 2 | 0                   | 1638.0       
             |
   | 7904046282518428963 | 1509330109  | 2 | 0                   | 1368.0       
             |
   | 8566928176839891583 | -1402644643 | 2 | 0                   | 1368.0       
             |
   | 7224410078130478461 | -776509581  | 2 | 0                   | 1368.0       
             |
   | 5981193970486754997 | -1725568709 | 1 | 0                   | 1917.0       
             |
   | 6089584153122015492 | 1209163516  | 1 | 0                   | 1996.0       
             |
   | 6071982018954122379 | 1154898388  | 1 | 0                   | 1638.0       
             |
   | 7044330683984323480 | -765736418  | 1 | 0                   | 1750.0       
             |
   | 5170668904757974782 | 580435115   | 1 | 0                   | 1087.0       
             |
   | 7121372218861667575 | -888761092  | 1 | 0                   | 1368.0       
             |
   
+---------------------+-------------+---+---------------------+---------------------------+
   10 rows in set. Query took 48.886 seconds.
   ```
   
   Before #6003 DF took 62 and 58 sec respectively. 
   That mean @mingmwang fixes performance and `slice_and_maybe_filter` was not 
a bottleneck anymore so this PR doesn't bring any benefit. I will close it. 
Thanks all for participating


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to