Re: [PR] fix(parquet): bound data page byte size for large variable-width values [arrow-rs]

via GitHub Wed, 03 Jun 2026 15:03:12 -0700


adriangb commented on code in PR #9972:
URL: https://github.com/apache/arrow-rs/pull/9972#discussion_r3352196723



##########
parquet/src/column/writer/encoder.rs:
##########
@@ -247,6 +283,66 @@ impl<T: DataType> ColumnValueEncoder for 
ColumnValueEncoderImpl<T> {
         self.write_slice(&slice)
     }
 
+    fn count_values_within_byte_budget(
+        values: &[T::T],
+        offset: usize,
+        len: usize,
+        byte_budget: usize,
+    ) -> Option<usize> {
+        // Clamp so that a caller-supplied `len` that overruns the input
+        // (e.g. a level/value mismatch the encoder will reject later)
+        // returns an estimate instead of panicking here.
+        let end = (offset + len).min(values.len());
+        let start = offset.min(end);
+        let n = end - start;
+        // Fixed-size physical types have a constant per-value byte cost,
+        // so the answer is one division — no need to walk the slice.
+        let phys = <T::T as ParquetValueType>::PHYSICAL_TYPE;
+        if phys != Type::BYTE_ARRAY && phys != Type::FIXED_LEN_BYTE_ARRAY {
+            let per = std::mem::size_of::<T::T>().max(1);
+            let fits = (byte_budget / per).max(1);
+            return Some(fits.min(n));
+        }
+        // Variable-width: scan, accumulate, exit at the first value that
+        // pushes us past the budget. This both bounds skewed
+        // distributions (one outlier among small values is caught when
+        // it lands, regardless of position) and short-circuits when an
+        // early value alone exceeds the budget. The boundary value is
+        // included in the returned count so the caller's page-flush check
+        // trips on this mini-batch rather than leaving a sliver to glue
+        // onto the next page (see #9972 discussion).
+        let mut cum: usize = 0;
+        for (i, v) in values[start..end].iter().enumerate() {
+            cum = cum.saturating_add(plain_encoded_byte_size::<T>(v));
+            if cum > byte_budget {
+                return Some(i + 1);
+            }
+        }
+        Some(n)
+    }
+
+    fn count_values_within_byte_budget_gather(
+        values: &[T::T],
+        indices: &[usize],
+        byte_budget: usize,
+    ) -> Option<usize> {
+        let phys = <T::T as ParquetValueType>::PHYSICAL_TYPE;

Review Comment:
   Good call — deduped in cacbd4bf0a. Both methods now delegate to a shared 
private `count_within_budget` helper over an `Iterator<Item = Option<&T::T>>`: 
the contiguous path maps values to `Some`, the gather path maps indices through 
`values.get` (a `None` is counted but contributes no bytes, preserving the old 
defensive `continue`-but-still-advance-`i` behavior exactly).
   
   I benchmarked it because this module has a documented code-placement 
sensitivity (see the note on `plain_encoded_byte_size`, which is why the helper 
is also defined at the module end). Against a pre-change baseline on `string` / 
`string_and_binary_view` (`cargo bench --bench arrow_writer`), all cases came 
back within noise — `string/default` actually improved ~4%, the rest "no 
change"/"within noise" — so no regression. `column::writer` (100) + 
`arrow_writer_layout` (11) tests pass.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(parquet): bound data page byte size for large variable-width values [arrow-rs]

Reply via email to