[GitHub] [arrow-datafusion] devinjdangelo opened a new pull request, #7390: Support Configuring Arrow RecordBatch Writers via SQL Statement Options

via GitHub Wed, 23 Aug 2023 13:33:18 -0700


devinjdangelo opened a new pull request, #7390:
URL: https://github.com/apache/arrow-datafusion/pull/7390


   ## Which issue does this PR close?
   
   Closes #7322
   Closes #7298
   
   ## Rationale for this change
   
   Currently, the only way to configure how files are written out as a result 
of `INSERT` or `COPY` statements is via session level configuration. 
Additionally, string parsing of some special arbitrary SQL statement options 
are handled in various places in DataFusion on an adhoc basis (such as the 
current per_thread_output setting in `COPY `).
   
   This PR aims to consolidate logic for parsing arbitrary SQL statement 
options and support reuse of code, including in downstream systems which may 
choose to support their own special arbitrary options. 
   
   ## What changes are included in this PR?
   
   - Move existing parquet setting string parsing logic to datafusion-common
   - Create abstractions to make working with tuples of arbitrary string 
options easier
   - Create a new DataFusionError type for unsupported options
   - Implement string parsing logic for CSV and JSON settings
   - Support writing compressed CSV and JSON files (this one is a DataFusion 
specific write option)
   - Rename `per_thread_output` option to `single_file_output` as I believe 
this is less ambiguous in DataFusion (DuckDB uses the `per_thread_output` 
option, but the reference to threads doesn't really make sense in DataFusion). 
This also makes more sense in a table context where a listing table can be 
backed by a single file or not.
   
   An example of a new query supported by this PR:
   
   ```sql
   COPY source_table TO 
   './table' 
   (format parquet, 
   single_file_output false, 
   compression 'zstd(10)');
   ```
   
   ### Notable work that is important for follow ons
   - Support these options in the DataFrame::write_* functions without 
requiring passing of strings (current thinking is to allow passing a fully 
formed writer builder object directly).
   - Improve test coverage of various combinations of write options.
   
   ## Are these changes tested?
   
   Yes, via existing tests and some new ones. However, there is now a very 
large number of possible option combinations. We may want to implement some 
dynamic test case generation framework to ensure no specific combination of 
options will fail.
   
   ## Are there any user-facing changes?
   
   Yes, writes now support most options that the various arrow writers support 
via SQL statements. The primary exception right now is column specific parquet 
options (i.e. setting an encoding for only 1 column). Additional work will be 
need to support that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] devinjdangelo opened a new pull request, #7390: Support Configuring Arrow RecordBatch Writers via SQL Statement Options

Reply via email to