devinjdangelo opened a new pull request, #7390: URL: https://github.com/apache/arrow-datafusion/pull/7390
## Which issue does this PR close? Closes #7322 Closes #7298 ## Rationale for this change Currently, the only way to configure how files are written out as a result of `INSERT` or `COPY` statements is via session level configuration. Additionally, string parsing of some special arbitrary SQL statement options are handled in various places in DataFusion on an adhoc basis (such as the current per_thread_output setting in `COPY `). This PR aims to consolidate logic for parsing arbitrary SQL statement options and support reuse of code, including in downstream systems which may choose to support their own special arbitrary options. ## What changes are included in this PR? - Move existing parquet setting string parsing logic to datafusion-common - Create abstractions to make working with tuples of arbitrary string options easier - Create a new DataFusionError type for unsupported options - Implement string parsing logic for CSV and JSON settings - Support writing compressed CSV and JSON files (this one is a DataFusion specific write option) - Rename `per_thread_output` option to `single_file_output` as I believe this is less ambiguous in DataFusion (DuckDB uses the `per_thread_output` option, but the reference to threads doesn't really make sense in DataFusion). This also makes more sense in a table context where a listing table can be backed by a single file or not. An example of a new query supported by this PR: ```sql COPY source_table TO './table' (format parquet, single_file_output false, compression 'zstd(10)'); ``` ### Notable work that is important for follow ons - Support these options in the DataFrame::write_* functions without requiring passing of strings (current thinking is to allow passing a fully formed writer builder object directly). - Improve test coverage of various combinations of write options. ## Are these changes tested? Yes, via existing tests and some new ones. However, there is now a very large number of possible option combinations. We may want to implement some dynamic test case generation framework to ensure no specific combination of options will fail. ## Are there any user-facing changes? Yes, writes now support most options that the various arrow writers support via SQL statements. The primary exception right now is column specific parquet options (i.e. setting an encoding for only 1 column). Additional work will be need to support that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
