rok opened a new issue, #8115:
URL: https://github.com/apache/arrow-rs/issues/8115

   #8029 introduced `ArrowWriter.get_column_writers` to expose 
`Vec<ArrowColumnWriter>` of a the "in progress" `ArrowRowGroupWriter`. This was 
to enable downstream libraries to concurrently write columns and row groups. 
However only one `ArrowRowGroupWriter` will exist at a time and all 
`ArrowColumnWriter`s need to complete before a new `RowGroup` can proceed to be 
serialized. This can be solved with locking but is not ideal. See 
https://github.com/apache/datafusion/pull/16738#issuecomment-3177700851.
   
   We could:
   1. Have downstream users locking and only serialize one RowGroup at a time.
   1. Have `ArrowWriter` keep a `Vec<ArrowRowGroupWriter>` for all `RowGroups` 
currently being serialized.
   1. Expose `ArrowRowGroupWriterFactory` of active `ArrowWriter`
   
   Additionally we should introduce a 
[write_parquet_with_small_rg_size](https://github.com/apache/datafusion/blob/7d5214512740b4dfb742b6b3d91ed9affcc2c9d0/datafusion/core/src/dataframe/parquet.rs#L201)
 with encryption to sufficiently test this codepath.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to