alamb opened a new issue, #7354:
URL: https://github.com/apache/arrow-datafusion/issues/7354

   ### Is your feature request related to a problem or challenge?
   
   As of now, you can 
   1. create an external table (implemented by `ListingTable`) that points at a 
local directory and can data to it which makes new files
   2. create an external table (implemented by `ListingTable`) that points at a 
local directory with a declared sort order and datafusion will take advantage 
of that order!
   
   Sadly you can not do both together -- insert data into external table that 
has had a sort order declared. For example:
   
   
   ```shell
   $ mkdir output
   $ datafusion-cli
   ```
   
   ```sql
   DataFusion CLI v29.0.0
   ❯ create external table output(time timestamp) stored as parquet location 
'output' with order (time);
   0 rows in set. Query took 0.002 seconds.
   
   ❯ insert into output values (now());
   This feature is not implemented: Writing to a sorted listing table via 
insert into is not supported yet. To write to this table in the meantime, 
register an equivalent table with file_sort_order = vec![]
   ```
   
   
   ### Describe the solution you'd like
   
   From @devinjdangelo comments in 
https://github.com/apache/arrow-datafusion/issues/6569#issuecomment-1683790637
   
   In the case of appending new files to a directory, I think it is as simple 
as having FileSinkExec require its input be sorted. DataFusion's optimizer 
should do the rest to ensure the new file is sorted properly. 
   
   
   In the case of a single file (`LOCATION 'foo.parquet'` for example), likely 
can't be handled efficiently as doing so  would require reading the existing 
file, merging that with the new data and rewriting the whole file. 
   
   
   
   ### Describe alternatives you've considered
   
   Alternatively, we could have a check to see if 1) the table is sorted and 2) 
the input to FileSinkExec is sorted. If 1) is true but 2) is not, we would need 
to update the metadata about the table to indicate for subsequent queries it is 
no longer guaranteed to be sorted.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to