devinjdangelo opened a new pull request, #7141: URL: https://github.com/apache/arrow-datafusion/pull/7141
# Which issue does this PR close? Closes #5076 # Rationale for this change The goal of this PR is to enable DataFrame write methods to leverage a common implementation with SQL `Insert Into` statements, so that common logic related to writing via `ObjectStore` and parallelization or other optimizations can be made in one place (such as those discussed in #7079). # What changes are included in this PR? The following changes are completed/planned: - [x] Implement `DataFrame.write_table` method which creates an insert_into `LogicalPlan`and executes eagerly - [x] Extend `InsertExec` / `DataSync` / `ListingTable.insert_into` to support writing multiple files from multiple partitions - [x] Extend `CsvSink` to support writing multiple partitions to multiple files - [ ] Create `JsonSink` supporting writing multiple partitions to multiple files - [ ] Create `ParquetSink` supporting writing multiple partitions to multiple files - [ ] Update existing `write_json`, `write_csv`, and `write_parquet` to create temporary tables and to call `DataFrame.write_table` # Are these changes tested? I have not yet implemented any new tests to cover these changes. Any suggestions on new tests are welcome. # Are there any user-facing changes? The goal is for existing `DataFrame` write methods to behave nearly identically to before. The `write_table` method is a new public method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
