masumi-ryugo opened a new pull request, #9883:
URL: https://github.com/apache/arrow-rs/pull/9883
**Builds on #9868.** When that PR is merged this rebases to a single commit;
until then the diff includes its commit as ancestor.
## What
Cap the up-front list-allocation in
`parquet::parquet_thrift::read_thrift_vec` by the bytes still available to the
input protocol. Adds a `bytes_remaining()` method to
`ThriftCompactInputProtocol` (default `usize::MAX`; overridden in
`ThriftSliceInputProtocol` to return `self.buf.len()`).
```rust
let declared = list_ident.size as usize;
let bound = declared.min(prot.bytes_remaining());
let mut res: Vec<T> = Vec::new();
if res.try_reserve_exact(bound).is_err() { … }
for _ in 0..list_ident.size { // EOF surfaces if header
overstated size
let val = T::read_thrift(prot)?;
res.push(val);
}
```
## Why `try_reserve_exact` alone isn't enough
#9868's `try_reserve_exact` is the right defense in principle, but on
**Linux with default overcommit** the allocator returns success for the
requested size and the OOM only surfaces under tools that observe the malloc
size (libFuzzer, ASan). In practice that means production processes will still
OOM-kill on these inputs.
A 7-byte cargo-fuzz repro produced by the new `parquet/fuzz/thrift_decode`
harness (#5332) demonstrates this:
```
==…== ERROR: libFuzzer: out-of-memory (malloc(6643777440))
…
#11 0x… RawVec<…SchemaElement>::try_reserve_exact
alloc/src/raw_vec/mod.rs:384
#12 0x… parquet_thrift::read_thrift_vec::<SchemaElement,
ThriftSliceInputProtocol>
parquet_thrift.rs:713
#13 0x… parquet_metadata_from_bytes
metadata/thrift/mod.rs:790
```
Input bytes: `[0x26, 0xf6, 0xf6, 0xf6, 0xd4, 0x6b, 0x6c]` — `i32::MAX`-class
list size, then `i32::MAX * sizeof(SchemaElement) ≈ 6.6 GiB`.
After this PR the same 7 bytes return `ParquetError::General("cannot
allocate Thrift list of … elements")` immediately, peak RSS stays in
single-digit MiB.
## Tests
Two new regression tests in `parquet::parquet_thrift::tests`:
- `test_decode_metadata_oom_repro_7byte_does_not_oom` — the 7-byte
cargo-fuzz repro
- `test_decode_metadata_oom_repro_padded_does_not_oom` — a 42-byte variant
with `0x20` padding (matches the larger `arrow_reader`-harness repro)
The existing `test_decode_metadata_huge_thrift_list_does_not_panic` and
`test_read_list_begin_size_above_i32_max_returns_err` from #9868 continue to
pass.
## Out-of-scope (follow-ups)
I'll send these as separate PRs once this lands and #5332's fuzz harness is
set up to keep them honest:
1. **`parquet/src/schema/types.rs:1404`** — `Vec::with_capacity(n as usize)`
over an attacker-controlled `num_children`. Reachable downstream of thrift
decoding; produces a separate `capacity overflow` panic on the same input
class. A 22-byte fuzz repro is already in hand.
2. **`parquet/src/format.rs`** — ~22 hand-rolled
`Vec::with_capacity(list_ident.size as usize)` sites in the legacy thrift
codegen path. Identical mitigation pattern (bound by `bytes_remaining()`), but
the file is large and noisy enough to deserve its own PR.
---
xref #5332 #9868 #9869 #9874
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]