[GitHub] [arrow-datafusion] devinjdangelo opened a new issue, #7298: Implement support for FileFormat Options for COPY and Create External Table

via GitHub Wed, 16 Aug 2023 04:39:12 -0700


devinjdangelo opened a new issue, #7298:
URL: https://github.com/apache/arrow-datafusion/issues/7298


   ### Is your feature request related to a problem or challenge?
   
   Currently, the only way to customize how files are written as the result of 
a `COPY` or `INSERT` query is via session level defaults. E.g.
   
   ```sql
   set datafusion.execution.parquet.max_row_group_size=9999;
   
   INSERT INTO my_table values (1,2), (3,4);
   COPY my_table to mytable.parquet;
   ``` 
   
   We should implement statement and table level options so individual 
statements can customize the write behavior as desired. E.g.:
   
   ```sql
   COPY my_table to mytable.parquet (max_row_group_size 9999)
   ```
   
   Or to set default options for a specific table, rather than globally in a 
session:
   
   ```sql
   CREATE EXTERNAL TABLE my_table 
   STORED AS PARQUET
   LOCATION 'my_table/' 
   OPTIONS (max_row_group_size 9999)
   ```
   
   ### Describe the solution you'd like
   
   We could implement a `WriteOptions` struct with a 
`WriteOptions::from(Vec<(String, String)>)` method so the struct can be created 
from arbitrary string tuples passed like in the statements above. `FileSink`s 
could then accept a `WriteOptions` struct and use it to construct a serializer 
with the desired settings. DataFrame API can be refactored to accept 
`WriteOptions `directly.
   
   The existing code which creates a `parquet::WriterProperties` from session 
configs should be refactored to reduce code duplication / share implementation 
details with parsing statement level overrides.
   
   ### Describe alternatives you've considered
   
   Rather than just a generic `WriteOptions` struct, we may want a 
`WriteOptions` trait with specific structs for each file format, i.e. 
`CsvWriteOptions`. Each file format can decide how to handle each option and if 
desired emit a warning/error if invalid options are passed (e.g. row_group_size 
is passed to Csv writer).
   
   ### Additional context
   
   Relevant recent PRs for supporting writes: #7244 #7283 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] devinjdangelo opened a new issue, #7298: Implement support for FileFormat Options for COPY and Create External Table

Reply via email to