alamb commented on PR #7192: URL: https://github.com/apache/arrow-datafusion/pull/7192#issuecomment-1692062455
Hi @avantgardnerio -- I have been playing with this branch but I am not
sure it is running on my test case.
Here is what I have been trying.
Download traces_nd.zip (200MB)
```sql
❯ describe 'traces_nd.parquet';
+--------------------+-----------------------------+-------------+
| column_name | data_type | is_nullable |
+--------------------+-----------------------------+-------------+
| attributes | Utf8 | YES |
| duration_nano | Int64 | YES |
| end_time_unix_nano | Int64 | YES |
| service.name | Utf8 | YES |
| span.kind | Utf8 | YES |
| span.name | Utf8 | YES |
| span_id | Utf8 | YES |
| time | Timestamp(Nanosecond, None) | NO |
| trace_id | Utf8 | YES |
| otel.status_code | Utf8 | YES |
| parent_span_id | Utf8 | YES |
+--------------------+-----------------------------+-------------+
11 rows in set. Query took 0.172 seconds.
```
Then I ran my query using `datafusion-cli`
```sql
❯ SELECT trace_id, MAX(time) FROM 'traces_nd.parquet' GROUP BY trace_id
ORDER BY MAX(time) DESC LIMIT 10;
+----------------------------------+-----------------------------+
| trace_id | MAX(traces_nd.parquet.time) |
+----------------------------------+-----------------------------+
| 000000000000000038e297274d174adc | 2023-07-12T18:00:07.810307 |
| 00000000000000000e577b1c8d10a199 | 2023-07-12T18:00:07.801908 |
| 0000000000000000508650f3aa158415 | 2023-07-12T18:00:04.530134 |
| 00000000000000006eb9ebf6a5d98618 | 2023-07-12T18:00:04.512612 |
| 000000000000000001c045ec504c7a90 | 2023-07-12T17:57:29.495888 |
| 00000000000000001a594b08955f5c39 | 2023-07-12T17:57:29.491154 |
| 00000000000000006f6f658ef7e81179 | 2023-07-10T06:16:34.343527 |
| 000000000000000012ad3cd4d96e6583 | 2023-07-10T06:16:34.339103 |
| 00000000000000007b6a40833cb58ae3 | 2023-07-10T06:11:34.361362 |
| 0000000000000000775bc897341290e3 | 2023-07-10T06:11:34.357441 |
+----------------------------------+-----------------------------+
10 rows in set. Query took 1.176 seconds.
```
However, the `explain` doesn't indicate that that the `limit` was pushed to
the group by operator:
```sql
❯ explain SELECT trace_id, MAX(time) FROM 'traces_nd.parquet' GROUP BY
trace_id ORDER BY MAX(time) DESC LIMIT 10;
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Limit: skip=0, fetch=10
|
| | Sort: MAX(traces_nd.parquet.time) DESC NULLS FIRST,
fetch=10
|
| | Aggregate: groupBy=[[traces_nd.parquet.trace_id]],
aggr=[[MAX(traces_nd.parquet.time)]]
|
| | TableScan: traces_nd.parquet projection=[time,
trace_id]
|
| physical_plan | GlobalLimitExec: skip=0, fetch=10
|
| | SortPreservingMergeExec: [MAX(traces_nd.parquet.time)@1
DESC], fetch=10
|
| | SortExec: fetch=10,
expr=[MAX(traces_nd.parquet.time)@1 DESC]
|
| | AggregateExec: mode=FinalPartitioned,
gby=[trace_id@0 as trace_id], aggr=[MAX(traces_nd.parquet.time)]
|
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([trace_id@0],
8), input_partitions=8
|
| | AggregateExec: mode=Partial, gby=[trace_id@1
as trace_id], aggr=[MAX(traces_nd.parquet.time)]
|
| | ParquetExec: file_groups={8 groups:
[[home/alamb/traces/traces_nd.parquet:0..26131427],
[home/alamb/traces/traces_nd.parquet:26131427..52262854],
[home/alamb/traces/traces_nd.parquet:52262854..78394281],
[home/alamb/traces/traces_nd.parquet:78394281..104525708],
[home/alamb/traces/traces_nd.parquet:104525708..130657135], ...]},
projection=[time, trace_id] |
| |
|
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.
```
Am I doing something wrong here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
