alamb commented on PR #7192:
URL: 
https://github.com/apache/arrow-datafusion/pull/7192#issuecomment-1692062455

   Hi @avantgardnerio  -- I have been playing with this branch but I am not 
sure it is running on my test case.
   
   Here is what I have been trying. 
   
   Download traces_nd.zip (200MB)
   ```sql
   ❯ describe 'traces_nd.parquet';
   +--------------------+-----------------------------+-------------+
   | column_name        | data_type                   | is_nullable |
   +--------------------+-----------------------------+-------------+
   | attributes         | Utf8                        | YES         |
   | duration_nano      | Int64                       | YES         |
   | end_time_unix_nano | Int64                       | YES         |
   | service.name       | Utf8                        | YES         |
   | span.kind          | Utf8                        | YES         |
   | span.name          | Utf8                        | YES         |
   | span_id            | Utf8                        | YES         |
   | time               | Timestamp(Nanosecond, None) | NO          |
   | trace_id           | Utf8                        | YES         |
   | otel.status_code   | Utf8                        | YES         |
   | parent_span_id     | Utf8                        | YES         |
   +--------------------+-----------------------------+-------------+
   11 rows in set. Query took 0.172 seconds.
   ```
   
   Then I ran my query using `datafusion-cli` 
   
   ```sql
   ❯ SELECT trace_id, MAX(time) FROM 'traces_nd.parquet' GROUP BY trace_id 
ORDER BY MAX(time) DESC LIMIT 10;
   +----------------------------------+-----------------------------+
   | trace_id                         | MAX(traces_nd.parquet.time) |
   +----------------------------------+-----------------------------+
   | 000000000000000038e297274d174adc | 2023-07-12T18:00:07.810307  |
   | 00000000000000000e577b1c8d10a199 | 2023-07-12T18:00:07.801908  |
   | 0000000000000000508650f3aa158415 | 2023-07-12T18:00:04.530134  |
   | 00000000000000006eb9ebf6a5d98618 | 2023-07-12T18:00:04.512612  |
   | 000000000000000001c045ec504c7a90 | 2023-07-12T17:57:29.495888  |
   | 00000000000000001a594b08955f5c39 | 2023-07-12T17:57:29.491154  |
   | 00000000000000006f6f658ef7e81179 | 2023-07-10T06:16:34.343527  |
   | 000000000000000012ad3cd4d96e6583 | 2023-07-10T06:16:34.339103  |
   | 00000000000000007b6a40833cb58ae3 | 2023-07-10T06:11:34.361362  |
   | 0000000000000000775bc897341290e3 | 2023-07-10T06:11:34.357441  |
   +----------------------------------+-----------------------------+
   10 rows in set. Query took 1.176 seconds.
   ```
   
   However, the `explain` doesn't indicate that that the `limit` was pushed to 
the group by operator:
   
   ```sql
   ❯ explain SELECT trace_id, MAX(time) FROM 'traces_nd.parquet' GROUP BY 
trace_id ORDER BY MAX(time) DESC LIMIT 10;
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
                                                                         |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Limit: skip=0, fetch=10                                    
                                                                                
                                                                                
                                                                                
                                                                         |
   |               |   Sort: MAX(traces_nd.parquet.time) DESC NULLS FIRST, 
fetch=10                                                                        
                                                                                
                                                                                
                                                                              |
   |               |     Aggregate: groupBy=[[traces_nd.parquet.trace_id]], 
aggr=[[MAX(traces_nd.parquet.time)]]                                            
                                                                                
                                                                                
                                                                             |
   |               |       TableScan: traces_nd.parquet projection=[time, 
trace_id]                                                                       
                                                                                
                                                                                
                                                                               |
   | physical_plan | GlobalLimitExec: skip=0, fetch=10                          
                                                                                
                                                                                
                                                                                
                                                                         |
   |               |   SortPreservingMergeExec: [MAX(traces_nd.parquet.time)@1 
DESC], fetch=10                                                                 
                                                                                
                                                                                
                                                                          |
   |               |     SortExec: fetch=10, 
expr=[MAX(traces_nd.parquet.time)@1 DESC]                                       
                                                                                
                                                                                
                                                                                
                            |
   |               |       AggregateExec: mode=FinalPartitioned, 
gby=[trace_id@0 as trace_id], aggr=[MAX(traces_nd.parquet.time)]                
                                                                                
                                                                                
                                                                                
        |
   |               |         CoalesceBatchesExec: target_batch_size=8192        
                                                                                
                                                                                
                                                                                
                                                                         |
   |               |           RepartitionExec: partitioning=Hash([trace_id@0], 
8), input_partitions=8                                                          
                                                                                
                                                                                
                                                                         |
   |               |             AggregateExec: mode=Partial, gby=[trace_id@1 
as trace_id], aggr=[MAX(traces_nd.parquet.time)]                                
                                                                                
                                                                                
                                                                           |
   |               |               ParquetExec: file_groups={8 groups: 
[[home/alamb/traces/traces_nd.parquet:0..26131427], 
[home/alamb/traces/traces_nd.parquet:26131427..52262854], 
[home/alamb/traces/traces_nd.parquet:52262854..78394281], 
[home/alamb/traces/traces_nd.parquet:78394281..104525708], 
[home/alamb/traces/traces_nd.parquet:104525708..130657135], ...]}, 
projection=[time, trace_id] |
   |               |                                                            
                                                                                
                                                                                
                                                                                
                                                                         |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.004 seconds.
   ```
   
   Am I doing something wrong here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to