wiedld opened a new pull request, #9548: URL: https://github.com/apache/arrow-datafusion/pull/9548
**POC for Discussion only: DO NOT MERGE.** ## Which issue does this PR close? For discussion of https://github.com/apache/arrow-datafusion/issues/9493. ## Rationale for this change We are proposed a generalized public API that provides access to parallelized parquet writes outside of the COPYTO execution context. The code shared here is **NOT** the changes we are requesting. Instead, it shows what current limitations exist when trying to use the ParquetSink for parquet writing, instead of the ArrowWriter. ## What changes are included in this PR? What ArrowWriter already provided, and we had to change in order to use ParquetSink: * expose the FileMetaData associated with the created parquet: * **ArrowWriter already provides:** * in the [ArrowWriter::close() return signature](https://github.com/apache/arrow-rs/blob/c6ba0f764a9142b74c9070db269de04d2701d112/parquet/src/arrow/arrow_writer/mod.rs#L254), the FileMetaData is provided. * **ParquetSink had to be changed:** * as ParquetSink is intended for use inside a query execution context, and writes to 1+ file sinks, it does not currently return any FileMetaData associated with any sinks. * We had to change this, in order for the POC to work. * provide the appropriate schema in the kv store: * **ArrowWriter already provides:** * the [ArrowWriter::try_new() both serializes the schema and maps to the appropriate key](https://github.com/apache/arrow-rs/blob/c6ba0f764a9142b74c9070db269de04d2701d112/parquet/src/arrow/arrow_writer/mod.rs#L141) within the kv_store of the WriterPropertries * **ParquetSink had to be changed:** * whereas the ParquetSink does not include this functionality. * As such, we had to provide this mutation of WriterProperties in our own code (by extracting the `add_encoded_arrow_schema_to_metadata()` and associated upstream functionality). ## Are these changes tested? This code will not be merged. ## Are there any user-facing changes? This code will not be merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
