alamb commented on code in PR #8162:
URL: https://github.com/apache/arrow-rs/pull/8162#discussion_r2358750140
##########
parquet/src/arrow/async_writer/mod.rs:
##########
@@ -288,34 +288,17 @@ impl<W: AsyncFileWriter> AsyncArrowWriter<W> {
Ok(())
}
-
- /// Create a new row group writer and return its column writers.
- pub async fn get_column_writers(&mut self) ->
Result<Vec<ArrowColumnWriter>> {
Review Comment:
I double checked and this code is not yet released, so this is not a public
API change. It was added in
- https://github.com/apache/arrow-rs/pull/8262
##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -906,6 +918,12 @@ impl ArrowRowGroupWriterFactory {
let writers = get_column_writers(&self.schema, &self.props,
&self.arrow_schema)?;
Ok(ArrowRowGroupWriter::new(writers, &self.arrow_schema))
}
+
+ /// Create column writers for a new row group.
+ pub fn create_column_writers(&self, row_group_index: usize) ->
Result<Vec<ArrowColumnWriter>> {
Review Comment:
I see now that this is the key API -- create column writers with the
relevant encryption properties, if relevant
##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -409,6 +409,7 @@ impl<W: Write + Send> ArrowWriter<W> {
}
/// Create a new row group writer and return its column writers.
+ #[deprecated(since = "56.2.0", note = "Use into_serialized_writer
instead")]
Review Comment:
Note to myself, these APIs were added in
- https://github.com/apache/arrow-rs/pull/8029
So they are relatively new
##########
parquet/src/arrow/async_writer/mod.rs:
##########
@@ -361,7 +344,13 @@ mod tests {
// Use classic API
writer.write(&to_write_record).await.unwrap();
- let mut writers = writer.get_column_writers().await.unwrap();
+ // Use low-level API to write an Arrow group
+ let arrow_writer = writer.sync_writer;
Review Comment:
I think it would be good to use public APIs in the tests, rather than
accessing the inner fields directly (which is not possible from other crates)
##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -426,6 +428,15 @@ impl<W: Write + Send> ArrowWriter<W> {
row_group_writer.close()?;
Ok(())
}
+
+ /// Converts this writer into a lower-level [`SerializedFileWriter`] and
[`ArrowRowGroupWriterFactory`].
Review Comment:
I spent quite a while trying to figure out why we can't just use
`get_column_writers` as in the
[example](https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html#example-encoding-two-arrow-arrays-in-parallel)
to use ArrowRowGroupWriterFactory and I (finally) realized the reason is the
encryption configuration isn't passed to `get_column_writers`.
`ArrowRowGroupWriterFactory` does have the encryption details and thus can make
the correct ArrowColumnWriters.
I think it is somewhat of a strange API to create an `ArrowWriter` only to
immediately destructure it into a SerializedWriter / the underlying writer. It
is also unfortunate we now have two different APIs for writing row groups in
parallel, depending on encryption.
I have an idea to make the APIs better as a follow on.
##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -426,6 +428,15 @@ impl<W: Write + Send> ArrowWriter<W> {
row_group_writer.close()?;
Ok(())
}
+
+ /// Converts this writer into a lower-level [`SerializedFileWriter`] and
[`ArrowRowGroupWriterFactory`].
+ /// This can be useful to provide more control over how files are written.
+ pub fn into_serialized_writer(
Review Comment:
I spent quite some time trying to figure out why this API is needed --
specifically "why do we need an `ArrowWriter` at all, why not use
SerializedFileWriter and `get_column_writers` directly, as shown in [this
example](https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html#example-encoding-two-arrow-arrays-in-parallel)
After study I concluded the reason we need to expose
`ArrowRowGroupWriterFactory` is that
`ArrowRowGroupWriterFactory::create_column_writers` also has the appropriate
encryption properties.
It is unfortunate that we'll now have two different sets of APIs for
creating column writers -- via `get_column_writers` AND
`ArrowRowGroupWriterFactory`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]