alamb commented on issue #9493:
URL: 
https://github.com/apache/arrow-datafusion/issues/9493#issuecomment-1984464145

   Thank you @wiedld  -- would it be possible to create a PR with the content 
from 
https://github.com/apache/arrow-datafusion/compare/main...wiedld:arrow-datafusion:test/parquet-data-sink
 to help the API discussion? 
   
   At a high level I would like to suggest a new API (factored out from the 
current `ParquetSink`). 
   
   I believe @tustvold  was potentially thinking about proposing some/all of 
this API upstream in the `parquet` crate (maybe by adding some mode / option to 
[`AsyncWriter`](https://docs.rs/parquet/latest/parquet/arrow/async_writer/struct.AsyncArrowWriter.html).
 but he may have other ideas)
   
   Here is the kind of API I was imagining in DataFusion
   
   ```rust
   // get a RecordBatchStream somehow
   let stream: SendableRecordBatchStream = plan.execute(0);
   
   // location in object store to write to
   let  object_store_path = ...;
   
   // properties to pass to the underlying prquet writer
   let  parquet_writer_properties = ...;
   
   
   // Create a new parallel writer for writing a single file
   // Note this API doesn't handle writing partitioned/multiple files, that 
would
   // still be done by the ParquetSink
   let writer = ParallelParquetWriter::builder()
     .with_target(object_store_path)
     .with_writer_properties(parquet_writer_properties)
     // how many row groups should the parquet writer attempt to write in 
parallel
     // note the writer may buffer the (uncompressed) RecordBatches for up to 
`N - 1`
     // row groups. This setting defaults to `1`.
     // TODO? Should we have this?
     .with_concurrent_row_groups(2)
     .build()?
   
   // Invoke the writer: encodes the columns in parallel and uploads each row 
group
   // using a multi-part put.
   writer.write_all(stream)
     .await?
   
   ```
   
   Also Some of the changes are probably pretty straight forward to pull into 
their own PRs, such as the following. @wiedld  perhaps you could make a PR to 
add those (non potentially controversial APIs)?
   
   ```rust
       /// Returns as Url
       pub fn as_url(&self) -> &Url {
           &self.url
       }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to