Re: [PR] Pluggable page spilling for the Parquet ArrowWriter (PageStore) [arrow-rs]

via GitHub Tue, 02 Jun 2026 10:31:04 -0700


alamb commented on PR #10020:
URL: https://github.com/apache/arrow-rs/pull/10020#issuecomment-4605186038


   > We currently buffer entire row groups in memory. From [our own 
docs](https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html#memory-usage-and-limiting):
   > 
   > > The nature of Parquet requires buffering of an entire row group before 
it can be flushed to the underlying writer.
   > 
   > For our production workload where we have ~400 columns with large data 
skews (some much larger than others) this causes >=12GBs of memory consumed 
_just to write Parquet_.
   
   FWIW this is one of the things that @westonpace highlighted in his recent 
talk [Weston Pace: DataFusion without Row Groups, Seattle/Bellevue April 2026
   ](https://www.youtube.com/watch?v=fDrmfDuPK3s&t=1s)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Pluggable page spilling for the Parquet ArrowWriter (PageStore) [arrow-rs]

Reply via email to