alamb commented on code in PR #8162:
URL: https://github.com/apache/arrow-rs/pull/8162#discussion_r2358750140


##########
parquet/src/arrow/async_writer/mod.rs:
##########
@@ -288,34 +288,17 @@ impl<W: AsyncFileWriter> AsyncArrowWriter<W> {
 
         Ok(())
     }
-
-    /// Create a new row group writer and return its column writers.
-    pub async fn get_column_writers(&mut self) -> 
Result<Vec<ArrowColumnWriter>> {

Review Comment:
   I double checked and this code is not yet released, so this is not a public 
API change. It was added in 
   
   - https://github.com/apache/arrow-rs/pull/8262 



##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -906,6 +918,12 @@ impl ArrowRowGroupWriterFactory {
         let writers = get_column_writers(&self.schema, &self.props, 
&self.arrow_schema)?;
         Ok(ArrowRowGroupWriter::new(writers, &self.arrow_schema))
     }
+
+    /// Create column writers for a new row group.
+    pub fn create_column_writers(&self, row_group_index: usize) -> 
Result<Vec<ArrowColumnWriter>> {

Review Comment:
   I see now that this is the key API -- create column writers with the 
relevant encryption properties, if relevant



##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -409,6 +409,7 @@ impl<W: Write + Send> ArrowWriter<W> {
     }
 
     /// Create a new row group writer and return its column writers.
+    #[deprecated(since = "56.2.0", note = "Use into_serialized_writer 
instead")]

Review Comment:
   Note to myself, these APIs were added in
   - https://github.com/apache/arrow-rs/pull/8029
   So they are relatively new



##########
parquet/src/arrow/async_writer/mod.rs:
##########
@@ -361,7 +344,13 @@ mod tests {
         // Use classic API
         writer.write(&to_write_record).await.unwrap();
 
-        let mut writers = writer.get_column_writers().await.unwrap();
+        // Use low-level API to write an Arrow group
+        let arrow_writer = writer.sync_writer;

Review Comment:
   I think it would be good to use public APIs in the tests, rather than 
accessing the inner fields directly (which is not possible from other crates)



##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -426,6 +428,15 @@ impl<W: Write + Send> ArrowWriter<W> {
         row_group_writer.close()?;
         Ok(())
     }
+
+    /// Converts this writer into a lower-level [`SerializedFileWriter`] and 
[`ArrowRowGroupWriterFactory`].

Review Comment:
   I spent quite a while trying to figure out why we can't just use 
`get_column_writers` as in the 
[example](https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html#example-encoding-two-arrow-arrays-in-parallel)
 to use ArrowRowGroupWriterFactory  and I (finally) realized the reason is the 
encryption configuration isn't passed to `get_column_writers`. 
`ArrowRowGroupWriterFactory` does have the encryption details and thus can make 
the correct ArrowColumnWriters.
   
   I think it is somewhat of a strange API to create an `ArrowWriter` only to 
immediately destructure it into a SerializedWriter / the underlying writer. It 
is also unfortunate we now have two different APIs for writing row groups in 
parallel, depending on encryption.
   
   I have an idea to make the APIs better as a follow on.



##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -426,6 +428,15 @@ impl<W: Write + Send> ArrowWriter<W> {
         row_group_writer.close()?;
         Ok(())
     }
+
+    /// Converts this writer into a lower-level [`SerializedFileWriter`] and 
[`ArrowRowGroupWriterFactory`].
+    /// This can be useful to provide more control over how files are written.
+    pub fn into_serialized_writer(

Review Comment:
   I spent quite some time trying to figure out why this API is needed -- 
specifically "why do we need an `ArrowWriter` at all, why not use 
SerializedFileWriter and `get_column_writers` directly, as shown in [this 
example](https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html#example-encoding-two-arrow-arrays-in-parallel)
   
   After study I concluded the reason we need to expose 
`ArrowRowGroupWriterFactory` is that 
`ArrowRowGroupWriterFactory::create_column_writers` also has the appropriate 
encryption properties.
   
   It is unfortunate that we'll now have two different sets of APIs for 
creating column writers -- via `get_column_writers` AND 
`ArrowRowGroupWriterFactory`
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to