alamb opened a new issue, #10061: URL: https://github.com/apache/arrow-rs/issues/10061
**Describe the bug** The parquet writer only checks the data page size limit *after* writing a full mini-batch of values (`write_batch_size`, default 1024). When individual values are large (e.g. a 5 MB string), a single mini-batch can blow far past the configured page size limit before the check fires — we've seen pages up to 2 GB at default settings. The writer even documents this caveat in `parquet/src/column/writer/mod.rs`: > We check for DataPage limits only after we have inserted the values. If a user writes a large number of values, the DataPage size can be well above the limit. **To Reproduce** Write a BYTE_ARRAY column of large values (e.g. 64 × 64 KiB) with a small page byte limit (e.g. 16 KiB). Instead of roughly one page per value, the writer emits a single ~4 MiB page. **Expected behavior** Data (and dictionary) page sizes should respect the configured byte limit regardless of individual value size, without requiring users to set `write_batch_size` so small that it cripples write throughput for other columns. In the limit, this means one record per page (the parquet format minimum). **Additional context** The only existing workaround — shrinking `write_batch_size` — is a global knob that hurts performance for fixed-width and small-value columns while still being unable to bound a single oversized row. Related: #8263 (page size exceeding `i32::MAX` when writing huge blobs) is the extreme manifestation of the same unbounded-page-size root cause. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
