alamb opened a new issue, #7317: URL: https://github.com/apache/arrow-datafusion/issues/7317
### Is your feature request related to a problem or challenge? This is a follow on to https://github.com/apache/arrow-datafusion/issues/7036 As @bmmeijers says in https://github.com/apache/arrow-datafusion/issues/7036, datafusion can make much better plans if you tell it about the sort order of files. It is possible now to specify the order of a parquet file ```sql $ datafusion-cli DataFusion CLI v29.0.0 ❯ create external table cpu(time timestamp) stored as parquet location 'cpu.parquet' with order (time desc); 0 rows in set. Query took 0.001 seconds. ❯ select * from cpu; +---------------------+ | time | +---------------------+ | 2022-09-30T12:55:00 | +---------------------+ 1 row in set. Query took 0.003 seconds. ❯ explain select * from cpu order by time desc; +---------------+-----------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+-----------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Sort: cpu.time DESC NULLS FIRST | | | TableScan: cpu projection=[time] | | physical_plan | ParquetExec: file_groups={1 group: [[Users/alamb/Downloads/cpu.parquet]]}, projection=[time], output_ordering=[time@0 DESC] | | | | +---------------+-----------------------------------------------------------------------------------------------------------------------------+ 2 rows in set. Query took 0.001 seconds. ``` However, it is not possible to specify the time without also specifying all of the schema, which is redundant given the schema is stored in the parquet files: ```sql ❯ create external table cpu stored as parquet location 'cpu.parquet' with order (time desc); Error during planning: Provide a schema before specifying the order while creating a table. ``` Even though DataFusion can infer the schema automatically ```sql ❯ create external table cpu stored as parquet location 'cpu.parquet'; 0 rows in set. Query took 0.002 seconds. ❯ select * from cpu; +-----+---------------------+ | v | time | +-----+---------------------+ | 1.0 | 2023-03-01T00:00:00 | | 2.0 | 2023-03-02T00:00:00 | +-----+---------------------+ 2 rows in set. Query took 0.002 seconds. ``` ### Describe the solution you'd like I would like to be able to specify the sort order for parquet files without also specifying the schema Given this parquet file: [cpu.zip](https://github.com/apache/arrow-datafusion/files/12369253/cpu.zip) I would like this to work and produce a table both columns `v` and `time` ordered by `time`: ```sql ❯ create external table cpu stored as parquet location 'cpu.parquet' with order (time); Error during planning: Provide a schema before specifying the order while creating a table. ``` ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
