alamb commented on issue #9493: URL: https://github.com/apache/arrow-datafusion/issues/9493#issuecomment-1984464145
Thank you @wiedld -- would it be possible to create a PR with the content from https://github.com/apache/arrow-datafusion/compare/main...wiedld:arrow-datafusion:test/parquet-data-sink to help the API discussion? At a high level I would like to suggest a new API (factored out from the current `ParquetSink`). I believe @tustvold was potentially thinking about proposing some/all of this API upstream in the `parquet` crate (maybe by adding some mode / option to [`AsyncWriter`](https://docs.rs/parquet/latest/parquet/arrow/async_writer/struct.AsyncArrowWriter.html). but he may have other ideas) Here is the kind of API I was imagining in DataFusion ```rust // get a RecordBatchStream somehow let stream: SendableRecordBatchStream = plan.execute(0); // location in object store to write to let object_store_path = ...; // properties to pass to the underlying prquet writer let parquet_writer_properties = ...; // Create a new parallel writer for writing a single file // Note this API doesn't handle writing partitioned/multiple files, that would // still be done by the ParquetSink let writer = ParallelParquetWriter::builder() .with_target(object_store_path) .with_writer_properties(parquet_writer_properties) // how many row groups should the parquet writer attempt to write in parallel // note the writer may buffer the (uncompressed) RecordBatches for up to `N - 1` // row groups. This setting defaults to `1`. // TODO? Should we have this? .with_concurrent_row_groups(2) .build()? // Invoke the writer: encodes the columns in parallel and uploads each row group // using a multi-part put. writer.write_all(stream) .await? ``` Also Some of the changes are probably pretty straight forward to pull into their own PRs, such as the following. @wiedld perhaps you could make a PR to add those (non potentially controversial APIs)? ```rust /// Returns as Url pub fn as_url(&self) -> &Url { &self.url } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
