[GitHub] [arrow-datafusion] alamb commented on pull request #7192: Create a Priority Queue based Aggregation with `limit`

via GitHub Fri, 25 Aug 2023 11:26:01 -0700


alamb commented on PR #7192:
URL: 
https://github.com/apache/arrow-datafusion/pull/7192#issuecomment-1693769155


   I ran some experiments on my `trace` datasets and TLDR is I measured a 10x 
performance improvement in both CPU and Memory usage. Very nice results 
@avantgardnerio 
   
   # CPU / time 
   
   | Query | Main | This Branch |
   |--------|--------|--------|
   | `SELECT trace_id, MAX(time) FROM 'traces_nd.parquet' GROUP BY trace_id 
ORDER BY MAX(time) DESC LIMIT 10;` | 0.618s | 0.118s |
   | `SELECT trace_id, MAX(time) FROM 'traces_nd.parquet' GROUP BY trace_id 
ORDER BY MAX(time) DESC LIMIT 10000;` | 0.756s | 0.262s |
   | 20x data`SELECT trace_id, MAX(time) FROM 'traces_nd.parquet' GROUP BY 
trace_id ORDER BY MAX(time) DESC LIMIT 10;` | 27.240 | 1.927 |
   
   
   I roughly ran like this:
   
   ```shell
   $ datafusion-cli -c "SELECT trace_id, MAX(time) FROM 'traces_nd.parquet' 
GROUP BY trace_id ORDER BY MAX(time) DESC LIMIT 10;"
   DataFusion CLI v30.0.0
   +----------------------------------+-----------------------------+
   | trace_id                         | MAX(traces_nd.parquet.time) |
   +----------------------------------+-----------------------------+
   | 000000000000000038e297274d174adc | 2023-07-12T18:00:07.810307  |
   | 00000000000000000e577b1c8d10a199 | 2023-07-12T18:00:07.801908  |
   | 0000000000000000508650f3aa158415 | 2023-07-12T18:00:04.530134  |
   | 00000000000000006eb9ebf6a5d98618 | 2023-07-12T18:00:04.512612  |
   | 000000000000000001c045ec504c7a90 | 2023-07-12T17:57:29.495888  |
   | 00000000000000001a594b08955f5c39 | 2023-07-12T17:57:29.491154  |
   | 00000000000000006f6f658ef7e81179 | 2023-07-10T06:16:34.343527  |
   | 000000000000000012ad3cd4d96e6583 | 2023-07-10T06:16:34.339103  |
   | 00000000000000007b6a40833cb58ae3 | 2023-07-10T06:11:34.361362  |
   | 0000000000000000775bc897341290e3 | 2023-07-10T06:11:34.357441  |
   +----------------------------------+-----------------------------+
   10 rows in set. Query took 0.643 seconds.
   ```
   
   # Memory:
   I measured memory using the `heap` profiler from `Instruments`:
   
   | Query | Main | This Branch |
   |--------|--------|--------|
   | 20x data `SELECT trace_id, MAX(time) FROM 'traces_nd' GROUP BY trace_id 
ORDER BY MAX(time) DESC LIMIT 10;` | 36G | 6GB |
   | 20x data `SELECT trace_id, MAX(time) FROM 'traces_nd' GROUP BY trace_id 
ORDER BY MAX(time) DESC LIMIT 10000;` | 40.1GB | 7.1GB |
   
   
   
   To check this I just scaled the problem up artificially (note I really love 
the new `COPY` command that @devinjdangelo  has hooked up): 
    
   ```shell
   mkdir traces_nd
   for i in `seq 1 20`; do datafusion-cli -c "copy (select trace_id || '_$i' as 
trace_id, time from 'traces_nd.parquet') to 'traces_nd/$i.parquet'" ; done
   
   datafusion-cli -c "select count(*) from traces_nd"
   DataFusion CLI v30.0.0
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 103714340       |
   +-----------------+
   1 row in set. Query took 0.009 seconds.
   
   ```
   
   And then ran
   
   ```shell
   datafusion-cli -c "SELECT trace_id, MAX(time) FROM 'traces_nd' GROUP BY 
trace_id ORDER BY MAX(time) DESC LIMIT 10;"
   ```
   
   ![Screenshot 2023-08-25 at 2 19 30 
PM](https://github.com/apache/arrow-datafusion/assets/490673/996e88c1-8f27-45fd-8660-cbb8de91a08d)
   ![Screenshot 2023-08-25 at 2 21 52 
PM](https://github.com/apache/arrow-datafusion/assets/490673/6de14d52-f4b8-4f71-a881-1b653e98a76a)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on pull request #7192: Create a Priority Queue based Aggregation with `limit`

Reply via email to