alamb opened a new issue, #10071:
URL: https://github.com/apache/arrow-rs/issues/10071

   ### Is your feature request related to a problem?
   `ArrowWriter` must buffer every column's *compressed* pages for an entire 
row group before it can splice them into contiguous column chunks at flush, so 
peak memory is ≈ Σ(compressed bytes of all column chunks) and grows with row 
group size. On wide, skewed schemas (e.g. ~400 columns, some columns far larger 
than others) this can consume >=12 GB of memory just to write Parquet. The only 
existing lever is flushing smaller row groups via `in_progress_size`, which 
trades away compression and read-time page/row-group pruning.
   
   ### Describe the solution you'd like
   Introduce a pluggable buffering trait (`PageStore`) so completed pages can 
be spilled (e.g. to disk) instead of held in memory, with the in-memory default 
preserving today's behavior and byte-for-byte output. The writer should stream 
each page back out of the store one at a time at splice rather than 
materializing the whole chunk, bounding peak write memory by the in-flight 
encoder/dictionary buffers instead of the row group size.
   
   ### Describe alternatives you've considered
   Reducing row group size to limit buffering, but this sacrifices encoding 
efficiency and read performance and doesn't address the underlying coupling 
where one column's size forces the page layout of others.
   
   ### Additional context
   Related issues:
   - #5828
   - #5450
   - #5484
   - Fixed by PR #10020.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to