bmmeijers opened a new issue, #7036:
URL: https://github.com/apache/arrow-datafusion/issues/7036

   ### Describe the bug
   
   I am trying to perform a range join on two parquet files with 
datafusion-cli, but it makes an unrealistic query plan.
   
   ```sql
   select * from points, intervals where pt between first and last order by pt;
   ```
   
   I try to express the sort order in both parquet files when registering the 
parquet files with datafusion-cli.
   However, I am somehow stuck with expressing this sort order as the `with 
order (columnname ordering)` clause seems to be unusable together with parquet 
files.
   
   ### To Reproduce
   
   
   The points table has 1 numeric column and the intervals table has two 
columns. With duckdb I generated parquet files for both:
   
   ```sql
   copy (select generate_series as pt from (select * from generate_series(1, 
100000000)) t) TO 'pts.parquet' (FORMAT PARQUET);
   copy (select generate_series as first, first +2 as last from (select * from 
generate_series (2, 100, 5))) TO 'intervals.parquet' (FORMAT PARQUET);
   ```
   
   (which I both ran through parquet-rewrite to add page level statistics).
   
   ```bash
   parquet-rewrite --input pts.parquet --output pts_stats.parquet 
--writer-version 2.0 --statistics-enabled page
   parquet-rewrite --input intervals.parquet --output intervals_stats.parquet 
--writer-version 2.0 --statistics-enabled page
   ```
   
   I would expect that the sort order present in both files could help very 
much to query efficiently (I would pick a sort-merge-join, but without the sort 
of the files being necessary, as the data is already sorted, so just the merge 
part with scanning for the overlapping parts of the table).
   
   Hence, I tried to register the parquet files with datafusion-cli and express 
that the files are already sorted.
   
   This works:
   
   ```sql
   CREATE EXTERNAL TABLE points
   STORED AS PARQUET
   LOCATION 'pts_stats.parquet';
   ```
   
   But this does not: 
   
   CREATE EXTERNAL TABLE points
   STORED AS PARQUET
   LOCATION 'pts_stats.parquet'
   WITH ORDER (pt ASC)
   ;
   
   It errors with: 
   
   ```
   Error during planning: Provide a schema before specifying the order while 
creating a table.
   ```
   
   If I try to specify the schema manually:
   
   ```sql
   CREATE EXTERNAL TABLE points
   (pt bigint)
   STORED AS PARQUET
   
   LOCATION 'pts_stats.parquet'
   WITH ORDER (pt ASC)
   ;
   ```
   
   It errors with:
   
   ```
   Error during planning: Column definitions can not be specified for PARQUET 
files.
   ```
   
   Which is not surprising.
   
   ### Expected behavior
   
   I would expect that the sort order can be expressed and that the query 
planner subsequently can execute a sort-merge-join, where the sort step would 
be unnecessary (as the data of both sides of the join is already sorted).
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to