[PR] fix(parquet): bound thrift list capacity by remaining input bytes [arrow-rs]

via GitHub Sun, 03 May 2026 00:45:48 -0700


masumi-ryugo opened a new pull request, #9883:
URL: https://github.com/apache/arrow-rs/pull/9883


   **Builds on #9868.** When that PR is merged this rebases to a single commit; 
until then the diff includes its commit as ancestor.
   
   ## What
   
   Cap the up-front list-allocation in 
`parquet::parquet_thrift::read_thrift_vec` by the bytes still available to the 
input protocol. Adds a `bytes_remaining()` method to 
`ThriftCompactInputProtocol` (default `usize::MAX`; overridden in 
`ThriftSliceInputProtocol` to return `self.buf.len()`).
   
   ```rust
   let declared = list_ident.size as usize;
   let bound = declared.min(prot.bytes_remaining());
   let mut res: Vec<T> = Vec::new();
   if res.try_reserve_exact(bound).is_err() { … }
   for _ in 0..list_ident.size {                      // EOF surfaces if header 
overstated size
       let val = T::read_thrift(prot)?;
       res.push(val);
   }
   ```
   
   ## Why `try_reserve_exact` alone isn't enough
   
   #9868's `try_reserve_exact` is the right defense in principle, but on 
**Linux with default overcommit** the allocator returns success for the 
requested size and the OOM only surfaces under tools that observe the malloc 
size (libFuzzer, ASan). In practice that means production processes will still 
OOM-kill on these inputs.
   
   A 7-byte cargo-fuzz repro produced by the new `parquet/fuzz/thrift_decode` 
harness (#5332) demonstrates this:
   
   ```
   ==…== ERROR: libFuzzer: out-of-memory (malloc(6643777440))
   …
   #11 0x…  RawVec<…SchemaElement>::try_reserve_exact   
alloc/src/raw_vec/mod.rs:384
   #12 0x…  parquet_thrift::read_thrift_vec::<SchemaElement, 
ThriftSliceInputProtocol>
                 parquet_thrift.rs:713
   #13 0x…  parquet_metadata_from_bytes                 
metadata/thrift/mod.rs:790
   ```
   
   Input bytes: `[0x26, 0xf6, 0xf6, 0xf6, 0xd4, 0x6b, 0x6c]` — `i32::MAX`-class 
list size, then `i32::MAX * sizeof(SchemaElement) ≈ 6.6 GiB`.
   
   After this PR the same 7 bytes return `ParquetError::General("cannot 
allocate Thrift list of … elements")` immediately, peak RSS stays in 
single-digit MiB.
   
   ## Tests
   
   Two new regression tests in `parquet::parquet_thrift::tests`:
   
   - `test_decode_metadata_oom_repro_7byte_does_not_oom` — the 7-byte 
cargo-fuzz repro
   - `test_decode_metadata_oom_repro_padded_does_not_oom` — a 42-byte variant 
with `0x20` padding (matches the larger `arrow_reader`-harness repro)
   
   The existing `test_decode_metadata_huge_thrift_list_does_not_panic` and 
`test_read_list_begin_size_above_i32_max_returns_err` from #9868 continue to 
pass.
   
   ## Out-of-scope (follow-ups)
   
   I'll send these as separate PRs once this lands and #5332's fuzz harness is 
set up to keep them honest:
   
   1. **`parquet/src/schema/types.rs:1404`** — `Vec::with_capacity(n as usize)` 
over an attacker-controlled `num_children`. Reachable downstream of thrift 
decoding; produces a separate `capacity overflow` panic on the same input 
class. A 22-byte fuzz repro is already in hand.
   2. **`parquet/src/format.rs`** — ~22 hand-rolled 
`Vec::with_capacity(list_ident.size as usize)` sites in the legacy thrift 
codegen path. Identical mitigation pattern (bound by `bytes_remaining()`), but 
the file is large and noisy enough to deserve its own PR.
   
   ---
   
   xref #5332 #9868 #9869 #9874
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix(parquet): bound thrift list capacity by remaining input bytes [arrow-rs]

Reply via email to