adriangb opened a new pull request, #9972:
URL: https://github.com/apache/arrow-rs/pull/9972

   ## Summary
   
   The parquet column writer checks the data page byte limit only **after** 
each mini-batch finishes writing, and mini-batches are sized by row count 
(`write_batch_size`, default 1024). For BYTE_ARRAY columns with large values — 
e.g. a 5 MiB image blob per row — a single 1024-row mini-batch can therefore 
buffer several GiB into one data page before the configured byte limit is ever 
consulted. Pages can exceed the limit by orders of magnitude. The writer even 
documents this (`parquet/src/column/writer/mod.rs`):
   
   > We check for DataPage limits only after we have inserted the values. If a 
user writes a large number of values, the DataPage size can be well above the 
limit.
   
   This PR makes the mini-batch size byte-budget aware:
   
   - For each chunk, compute `bytes_per_value` from the values about to be 
written and pick `sub_batch_size = page_byte_limit / bytes_per_value` (clamped 
≥ 1).
   - For typical small values — numeric columns, short strings — 
`sub_batch_size` ≥ chunk size, so we stay on the existing batched fast path 
with zero behavior change.
   - Only when individual values are large enough that a full chunk would blow 
the page does the sub-batch shrink — to one row per mini-batch in the limit, 
matching the format minimum of one record per page.
   
   Skip the byte-size check while parquet dictionary encoding is active: 
`estimated_value_bytes` returns plain-encoded size but a dict-encoded data page 
only stores small RLE indices, so the estimate would spuriously shrink pages. 
Dict fallback bounds dict-encoded pages independently.
   
   For repeated/nested columns the sub-batch steps record-by-record (rep == 0 
boundaries) so a record never spans data pages, matching the parquet format 
rule.
   
   ### Commit stack
   
   1. `bench(parquet): add short and large string write benches` — new 
`short_string_non_null` (1M × 8 B) and `large_string_non_null` (1024 × 256 KiB) 
cases that exercise both ends of the new code path.
   2. `fix(parquet): bound data page byte size for large variable-width values` 
— the actual fix. Adds `ParquetValueType::byte_size`, 
`ColumnValueEncoder::estimated_value_bytes`, `LevelDataRef::value_count`, and 
the byte-budget sub-batch logic.
   3. `perf(parquet): O(1) estimated_value_bytes for byte arrays with 
contiguous indices` — fast path for the arrow byte-array encoder: when indices 
are contiguous (every non-null column), total byte size is one subtraction on 
the `value_offsets()` buffer instead of a per-value loop.
   
   ### Regression test
   
   `test_column_writer_caps_page_size_for_large_byte_array_values` writes 64 × 
64 KiB BYTE_ARRAY values with a 16 KiB page byte limit. Before this fix that 
produced a single ~4 MiB page; after, it's one page per value (~64 pages, all 
within ~2× the value size).
   
   ### Bench results
   
   5-run medians, criterion `arrow_writer` bench, default writer properties, on 
a noisy laptop (run-to-run variance ~±1.6%):
   
   | bench | Δ vs main |
   |---|---|
   | `primitive/default` (i32 25% null) | −1.0% |
   | `primitive_non_null/default` | −0.0% |
   | `bool_non_null/default` | −1.2% |
   | `string/default` | +0.6% |
   | `short_string_non_null/default` (new, 1M × 8 B) | +0.2% |
   | `large_string_non_null/default` (new, 1024 × 256 KiB) | +1.2% |
   | `string_non_null/default` | −2.1% |
   | `string_dictionary/default` | +0.4% |
   | `list_primitive/default` | +0.5% |
   | `list_primitive_non_null/default` | +0.1% |
   
   All within laptop variance. An earlier draft regressed 
`short_string_non_null` by ~5–8 % because every chunk walked 1024 entries 
through `ArrayAccessor::value(idx).as_ref().len()`; the third commit fixes that 
by going through the offsets buffer directly when applicable.
   
   ## Test plan
   
   - [x] `cargo test -p parquet` — all 1276 tests pass.
   - [x] Regression test confirmed to fail on stock `main` (single ~4 MiB page 
for 64 × 64 KiB values vs a 16 KiB page limit).
   - [x] Sanity A/B/C bisection isolated the per-chunk cost to 
`estimated_value_bytes` specifically (the new value-count and plumbing cost ~0).
   - [x] Bench sweep above; numbers within laptop noise.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to