alamb commented on PR #10020: URL: https://github.com/apache/arrow-rs/pull/10020#issuecomment-4605186038
> We currently buffer entire row groups in memory. From [our own docs](https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html#memory-usage-and-limiting): > > > The nature of Parquet requires buffering of an entire row group before it can be flushed to the underlying writer. > > For our production workload where we have ~400 columns with large data skews (some much larger than others) this causes >=12GBs of memory consumed _just to write Parquet_. FWIW this is one of the things that @westonpace highlighted in his recent talk [Weston Pace: DataFusion without Row Groups, Seattle/Bellevue April 2026 ](https://www.youtube.com/watch?v=fDrmfDuPK3s&t=1s) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
