alamb opened a new issue, #10071: URL: https://github.com/apache/arrow-rs/issues/10071
### Is your feature request related to a problem? `ArrowWriter` must buffer every column's *compressed* pages for an entire row group before it can splice them into contiguous column chunks at flush, so peak memory is ≈ Σ(compressed bytes of all column chunks) and grows with row group size. On wide, skewed schemas (e.g. ~400 columns, some columns far larger than others) this can consume >=12 GB of memory just to write Parquet. The only existing lever is flushing smaller row groups via `in_progress_size`, which trades away compression and read-time page/row-group pruning. ### Describe the solution you'd like Introduce a pluggable buffering trait (`PageStore`) so completed pages can be spilled (e.g. to disk) instead of held in memory, with the in-memory default preserving today's behavior and byte-for-byte output. The writer should stream each page back out of the store one at a time at splice rather than materializing the whole chunk, bounding peak write memory by the in-flight encoder/dictionary buffers instead of the row group size. ### Describe alternatives you've considered Reducing row group size to limit buffering, but this sacrifices encoding efficiency and read performance and doesn't address the underlying coupling where one column's size forces the page layout of others. ### Additional context Related issues: - #5828 - #5450 - #5484 - Fixed by PR #10020. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
