wiedld opened a new pull request, #9548:
URL: https://github.com/apache/arrow-datafusion/pull/9548

   **POC for Discussion only: DO NOT MERGE.**
   
   
   ## Which issue does this PR close?
   
   For discussion of https://github.com/apache/arrow-datafusion/issues/9493.
   
   ## Rationale for this change
   
   We are proposed a generalized public API that provides access to 
parallelized parquet writes outside of the COPYTO execution context. The code 
shared here is **NOT** the changes we are requesting. Instead, it shows what 
current limitations exist when trying to use the ParquetSink for parquet 
writing, instead of the ArrowWriter.
   
   ## What changes are included in this PR?
   
   What ArrowWriter already provided, and we had to change in order to use 
ParquetSink:
   * expose the FileMetaData associated with the created parquet:
      * **ArrowWriter already provides:**
          * in the [ArrowWriter::close() return 
signature](https://github.com/apache/arrow-rs/blob/c6ba0f764a9142b74c9070db269de04d2701d112/parquet/src/arrow/arrow_writer/mod.rs#L254),
 the FileMetaData is provided.
      * **ParquetSink had to be changed:**
          * as ParquetSink is intended for use inside a query execution 
context, and writes to 1+ file sinks, it does not currently return any 
FileMetaData associated with any sinks.
          * We had to change this, in order for the POC to work.
   * provide the appropriate schema in the kv store:
      * **ArrowWriter already provides:**
          * the [ArrowWriter::try_new() both serializes the schema and maps to 
the appropriate 
key](https://github.com/apache/arrow-rs/blob/c6ba0f764a9142b74c9070db269de04d2701d112/parquet/src/arrow/arrow_writer/mod.rs#L141)
 within the kv_store of the WriterPropertries
      * **ParquetSink had to be changed:**
          * whereas the ParquetSink does not include this functionality.
          * As such, we had to provide this mutation of WriterProperties in our 
own code (by extracting the `add_encoded_arrow_schema_to_metadata()` and 
associated upstream functionality).
   
   
   
   ## Are these changes tested?
   
   This code will not be merged.
   
   ## Are there any user-facing changes?
   
   This code will not be merged.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to