youichi-uda commented on PR #9868:
URL: https://github.com/apache/arrow-rs/pull/9868#issuecomment-4365657715
Heads-up from the cargo-fuzz harness I'm prototyping for #5332 — this PR's
`try_reserve_exact` change reduces but does not fully close the OOM hole on
Linux. Concretely:
**Repro** (50 bytes, fits in a single libFuzzer input):
```
00000000 15 20 19 f3 ff ff ff 20 20 20 20 20 20 20 20 20 |. ......
|
00000010 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 |
|
00000020 20 20 20 20 20 20 20 20 20 20 2a 00 00 00 50 41 |
*...PA|
00000030 52 31 |R1|
```
This is `[footer thrift bytes] + [footer_len=42 LE u32] + "PAR1"`. The
thrift bytes encode a list field whose VLQ-encoded size is several billion (`f3
ff ff ff 20 ...`), targeting `FileMetaData.schema: list<SchemaElement>`.
**Result against this PR's `read_thrift_vec`** (built on `pr-9868` HEAD
`c36a092`, ASan + libfuzzer):
```
==…== ERROR: libFuzzer: out-of-memory (malloc(6643777440))
…
#11 0x… <alloc::raw_vec::RawVec<…SchemaElement>>::try_reserve_exact
alloc/src/raw_vec/mod.rs:384
#12 0x… parquet::parquet_thrift::read_thrift_vec::<…SchemaElement,
ThriftSliceInputProtocol>
parquet/src/parquet_thrift.rs:713
#13 0x… parquet::file::metadata::thrift::parquet_metadata_from_bytes
parquet/src/file/metadata/thrift/mod.rs:790
#14 0x… parquet::file::metadata::parser::decode_metadata
parquet/src/file/metadata/parser.rs:233
#… … decode_footer_metadata / try_parse / parse_and_finish
```
So the new `try_reserve_exact` path is engaged — but on Linux with default
overcommit, `try_reserve_exact(N)` still attempts the underlying `malloc(N *
sizeof(T))` before returning. ASan's allocator sees the requested size and
signals libFuzzer OOM before any `Err(TryReserveError)` can be observed. The
`i32::MAX` clamp in `read_list_begin` keeps things off the worst cliff but
`i32::MAX * sizeof(SchemaElement)` is still ~6.6 GiB.
**Suggested follow-on (separate PR or grow this one — I'm happy either
way)**: bound `len` by the input bytes still available to the protocol before
`try_reserve_exact`, mirroring the chunked-read fix in #9869 for `arrow-ipc`.
Each thrift element occupies at least 1 byte on the wire, so `min(declared_len,
bytes_remaining)` is a sound upper bound. `ThriftSliceInputProtocol` already
has `as_slice() -> &[u8]`; promoting a `bytes_remaining(&self) -> usize` to the
`ThriftCompactInputProtocol` trait would let `read_thrift_vec` bound the
up-front allocation correctly:
```rust
let declared = list_ident.size as usize;
let bound = declared.min(prot.bytes_remaining());
let mut res: Vec<T> = Vec::new();
if res.try_reserve_exact(bound).is_err() {
return Err(general_err!("cannot allocate Thrift list of {declared}
elements"));
}
```
That keeps the early-abort intent of this PR but stops the rogue
`i32::MAX`-class size from ever reaching the allocator. The legacy hand-rolled
`Vec::with_capacity(list_ident.size as usize)` sites in `parquet/src/format.rs`
(~22 of them, e.g. lines 5831, 5845 for `Vec<SchemaElement>` / `Vec<RowGroup>`)
would need the same treatment if/when they're still on the deserialization path.
Happy to send the trait-method + `read_thrift_vec` patch as a follow-up PR
once #5332's harness lands so we have a CI-side regression test for it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]