alamb opened a new issue, #7317:
URL: https://github.com/apache/arrow-datafusion/issues/7317

   ### Is your feature request related to a problem or challenge?
   
   This is a follow on to https://github.com/apache/arrow-datafusion/issues/7036
   
   As @bmmeijers says in 
https://github.com/apache/arrow-datafusion/issues/7036, datafusion can make 
much better plans if you tell it about the sort order of files. 
   
   It is possible now to specify the order of a parquet file
   
   ```sql
   $ datafusion-cli
   DataFusion CLI v29.0.0
   ❯ create external table cpu(time timestamp) stored as parquet location 
'cpu.parquet' with order (time desc);
   0 rows in set. Query took 0.001 seconds.
   
   ❯ select * from cpu;
   +---------------------+
   | time                |
   +---------------------+
   | 2022-09-30T12:55:00 |
   +---------------------+
   1 row in set. Query took 0.003 seconds.
   
   ❯ explain select * from cpu order by time desc;
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                 |
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Sort: cpu.time DESC NULLS FIRST                            
                                                                 |
   |               |   TableScan: cpu projection=[time]                         
                                                                 |
   | physical_plan | ParquetExec: file_groups={1 group: 
[[Users/alamb/Downloads/cpu.parquet]]}, projection=[time], 
output_ordering=[time@0 DESC] |
   |               |                                                            
                                                                 |
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.001 seconds.
   ```
   
   However, it is not possible to specify the time without also specifying all 
of the schema, which is redundant given the schema is stored in the parquet 
files:
   
   ```sql
   ❯ create external table cpu stored as parquet location 'cpu.parquet' with 
order (time desc);
   Error during planning: Provide a schema before specifying the order while 
creating a table.
   ```
   
   Even though DataFusion can infer the schema automatically
   
   ```sql
   ❯ create external table cpu stored as parquet location 'cpu.parquet';
   0 rows in set. Query took 0.002 seconds.
   
   ❯ select * from cpu;
   +-----+---------------------+
   | v   | time                |
   +-----+---------------------+
   | 1.0 | 2023-03-01T00:00:00 |
   | 2.0 | 2023-03-02T00:00:00 |
   +-----+---------------------+
   2 rows in set. Query took 0.002 seconds.
   ```
   
   ### Describe the solution you'd like
   
   I would like to be able to specify the sort order for parquet files without 
also specifying the schema
   
   Given this parquet file: 
[cpu.zip](https://github.com/apache/arrow-datafusion/files/12369253/cpu.zip)
   
   I would like this to work and produce a table both columns `v` and `time` 
ordered by `time`:
   
   ```sql
   ❯ create external table cpu stored as parquet location 'cpu.parquet' with 
order (time);
   Error during planning: Provide a schema before specifying the order while 
creating a table.
   ```
   
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to