rok opened a new issue, #8115: URL: https://github.com/apache/arrow-rs/issues/8115
#8029 introduced `ArrowWriter.get_column_writers` to expose `Vec<ArrowColumnWriter>` of a the "in progress" `ArrowRowGroupWriter`. This was to enable downstream libraries to concurrently write columns and row groups. However only one `ArrowRowGroupWriter` will exist at a time and all `ArrowColumnWriter`s need to complete before a new `RowGroup` can proceed to be serialized. This can be solved with locking but is not ideal. See https://github.com/apache/datafusion/pull/16738#issuecomment-3177700851. We could: 1. Have downstream users locking and only serialize one RowGroup at a time. 1. Have `ArrowWriter` keep a `Vec<ArrowRowGroupWriter>` for all `RowGroups` currently being serialized. 1. Expose `ArrowRowGroupWriterFactory` of active `ArrowWriter` Additionally we should introduce a [write_parquet_with_small_rg_size](https://github.com/apache/datafusion/blob/7d5214512740b4dfb742b6b3d91ed9affcc2c9d0/datafusion/core/src/dataframe/parquet.rs#L201) with encryption to sufficiently test this codepath. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org