alamb commented on PR #7192:
URL:
https://github.com/apache/arrow-datafusion/pull/7192#issuecomment-1693769155
I ran some experiments on my `trace` datasets and TLDR is I measured a 10x
performance improvement in both CPU and Memory usage. Very nice results
@avantgardnerio
# CPU / time
| Query | Main | This Branch |
|--------|--------|--------|
| `SELECT trace_id, MAX(time) FROM 'traces_nd.parquet' GROUP BY trace_id
ORDER BY MAX(time) DESC LIMIT 10;` | 0.618s | 0.118s |
| `SELECT trace_id, MAX(time) FROM 'traces_nd.parquet' GROUP BY trace_id
ORDER BY MAX(time) DESC LIMIT 10000;` | 0.756s | 0.262s |
| 20x data`SELECT trace_id, MAX(time) FROM 'traces_nd.parquet' GROUP BY
trace_id ORDER BY MAX(time) DESC LIMIT 10;` | 27.240 | 1.927 |
I roughly ran like this:
```shell
$ datafusion-cli -c "SELECT trace_id, MAX(time) FROM 'traces_nd.parquet'
GROUP BY trace_id ORDER BY MAX(time) DESC LIMIT 10;"
DataFusion CLI v30.0.0
+----------------------------------+-----------------------------+
| trace_id | MAX(traces_nd.parquet.time) |
+----------------------------------+-----------------------------+
| 000000000000000038e297274d174adc | 2023-07-12T18:00:07.810307 |
| 00000000000000000e577b1c8d10a199 | 2023-07-12T18:00:07.801908 |
| 0000000000000000508650f3aa158415 | 2023-07-12T18:00:04.530134 |
| 00000000000000006eb9ebf6a5d98618 | 2023-07-12T18:00:04.512612 |
| 000000000000000001c045ec504c7a90 | 2023-07-12T17:57:29.495888 |
| 00000000000000001a594b08955f5c39 | 2023-07-12T17:57:29.491154 |
| 00000000000000006f6f658ef7e81179 | 2023-07-10T06:16:34.343527 |
| 000000000000000012ad3cd4d96e6583 | 2023-07-10T06:16:34.339103 |
| 00000000000000007b6a40833cb58ae3 | 2023-07-10T06:11:34.361362 |
| 0000000000000000775bc897341290e3 | 2023-07-10T06:11:34.357441 |
+----------------------------------+-----------------------------+
10 rows in set. Query took 0.643 seconds.
```
# Memory:
I measured memory using the `heap` profiler from `Instruments`:
| Query | Main | This Branch |
|--------|--------|--------|
| 20x data `SELECT trace_id, MAX(time) FROM 'traces_nd' GROUP BY trace_id
ORDER BY MAX(time) DESC LIMIT 10;` | 36G | 6GB |
| 20x data `SELECT trace_id, MAX(time) FROM 'traces_nd' GROUP BY trace_id
ORDER BY MAX(time) DESC LIMIT 10000;` | 40.1GB | 7.1GB |
To check this I just scaled the problem up artificially (note I really love
the new `COPY` command that @devinjdangelo has hooked up):
```shell
mkdir traces_nd
for i in `seq 1 20`; do datafusion-cli -c "copy (select trace_id || '_$i' as
trace_id, time from 'traces_nd.parquet') to 'traces_nd/$i.parquet'" ; done
datafusion-cli -c "select count(*) from traces_nd"
DataFusion CLI v30.0.0
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 103714340 |
+-----------------+
1 row in set. Query took 0.009 seconds.
```
And then ran
```shell
datafusion-cli -c "SELECT trace_id, MAX(time) FROM 'traces_nd' GROUP BY
trace_id ORDER BY MAX(time) DESC LIMIT 10;"
```


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]