alamb opened a new issue, #10061:
URL: https://github.com/apache/arrow-rs/issues/10061

   **Describe the bug**
   
   The parquet writer only checks the data page size limit *after* writing a 
full mini-batch of values (`write_batch_size`, default 1024). When individual 
values are large (e.g. a 5 MB string), a single mini-batch can blow far past 
the configured page size limit before the check fires — we've seen pages up to 
2 GB at default settings. The writer even documents this caveat in 
`parquet/src/column/writer/mod.rs`:
   
   > We check for DataPage limits only after we have inserted the values. If a 
user writes a large number of values, the DataPage size can be well above the 
limit.
   
   **To Reproduce**
   
   Write a BYTE_ARRAY column of large values (e.g. 64 × 64 KiB) with a small 
page byte limit (e.g. 16 KiB). Instead of roughly one page per value, the 
writer emits a single ~4 MiB page.
   
   **Expected behavior**
   
   Data (and dictionary) page sizes should respect the configured byte limit 
regardless of individual value size, without requiring users to set 
`write_batch_size` so small that it cripples write throughput for other 
columns. In the limit, this means one record per page (the parquet format 
minimum).
   
   **Additional context**
   
   The only existing workaround — shrinking `write_batch_size` — is a global 
knob that hurts performance for fixed-width and small-value columns while still 
being unable to bound a single oversized row.
   
   Related: #8263 (page size exceeding `i32::MAX` when writing huge blobs) is 
the extreme manifestation of the same unbounded-page-size root cause.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to