adriangb opened a new pull request, #9972: URL: https://github.com/apache/arrow-rs/pull/9972
## Summary The parquet column writer checks the data page byte limit only **after** each mini-batch finishes writing, and mini-batches are sized by row count (`write_batch_size`, default 1024). For BYTE_ARRAY columns with large values — e.g. a 5 MiB image blob per row — a single 1024-row mini-batch can therefore buffer several GiB into one data page before the configured byte limit is ever consulted. Pages can exceed the limit by orders of magnitude. The writer even documents this (`parquet/src/column/writer/mod.rs`): > We check for DataPage limits only after we have inserted the values. If a user writes a large number of values, the DataPage size can be well above the limit. This PR makes the mini-batch size byte-budget aware: - For each chunk, compute `bytes_per_value` from the values about to be written and pick `sub_batch_size = page_byte_limit / bytes_per_value` (clamped ≥ 1). - For typical small values — numeric columns, short strings — `sub_batch_size` ≥ chunk size, so we stay on the existing batched fast path with zero behavior change. - Only when individual values are large enough that a full chunk would blow the page does the sub-batch shrink — to one row per mini-batch in the limit, matching the format minimum of one record per page. Skip the byte-size check while parquet dictionary encoding is active: `estimated_value_bytes` returns plain-encoded size but a dict-encoded data page only stores small RLE indices, so the estimate would spuriously shrink pages. Dict fallback bounds dict-encoded pages independently. For repeated/nested columns the sub-batch steps record-by-record (rep == 0 boundaries) so a record never spans data pages, matching the parquet format rule. ### Commit stack 1. `bench(parquet): add short and large string write benches` — new `short_string_non_null` (1M × 8 B) and `large_string_non_null` (1024 × 256 KiB) cases that exercise both ends of the new code path. 2. `fix(parquet): bound data page byte size for large variable-width values` — the actual fix. Adds `ParquetValueType::byte_size`, `ColumnValueEncoder::estimated_value_bytes`, `LevelDataRef::value_count`, and the byte-budget sub-batch logic. 3. `perf(parquet): O(1) estimated_value_bytes for byte arrays with contiguous indices` — fast path for the arrow byte-array encoder: when indices are contiguous (every non-null column), total byte size is one subtraction on the `value_offsets()` buffer instead of a per-value loop. ### Regression test `test_column_writer_caps_page_size_for_large_byte_array_values` writes 64 × 64 KiB BYTE_ARRAY values with a 16 KiB page byte limit. Before this fix that produced a single ~4 MiB page; after, it's one page per value (~64 pages, all within ~2× the value size). ### Bench results 5-run medians, criterion `arrow_writer` bench, default writer properties, on a noisy laptop (run-to-run variance ~±1.6%): | bench | Δ vs main | |---|---| | `primitive/default` (i32 25% null) | −1.0% | | `primitive_non_null/default` | −0.0% | | `bool_non_null/default` | −1.2% | | `string/default` | +0.6% | | `short_string_non_null/default` (new, 1M × 8 B) | +0.2% | | `large_string_non_null/default` (new, 1024 × 256 KiB) | +1.2% | | `string_non_null/default` | −2.1% | | `string_dictionary/default` | +0.4% | | `list_primitive/default` | +0.5% | | `list_primitive_non_null/default` | +0.1% | All within laptop variance. An earlier draft regressed `short_string_non_null` by ~5–8 % because every chunk walked 1024 entries through `ArrayAccessor::value(idx).as_ref().len()`; the third commit fixes that by going through the offsets buffer directly when applicable. ## Test plan - [x] `cargo test -p parquet` — all 1276 tests pass. - [x] Regression test confirmed to fail on stock `main` (single ~4 MiB page for 64 × 64 KiB values vs a 16 KiB page limit). - [x] Sanity A/B/C bisection isolated the per-chunk cost to `estimated_value_bytes` specifically (the new value-count and plumbing cost ~0). - [x] Bench sweep above; numbers within laptop noise. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
