youichi-uda commented on PR #9868:
URL: https://github.com/apache/arrow-rs/pull/9868#issuecomment-4365657715

   Heads-up from the cargo-fuzz harness I'm prototyping for #5332 — this PR's 
`try_reserve_exact` change reduces but does not fully close the OOM hole on 
Linux. Concretely:
   
   **Repro** (50 bytes, fits in a single libFuzzer input):
   ```
   00000000  15 20 19 f3 ff ff ff 20  20 20 20 20 20 20 20 20  |. ......        
|
   00000010  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                
|
   00000020  20 20 20 20 20 20 20 20  20 20 2a 00 00 00 50 41  |          
*...PA|
   00000030  52 31                                              |R1|
   ```
   
   This is `[footer thrift bytes] + [footer_len=42 LE u32] + "PAR1"`. The 
thrift bytes encode a list field whose VLQ-encoded size is several billion (`f3 
ff ff ff 20 ...`), targeting `FileMetaData.schema: list<SchemaElement>`.
   
   **Result against this PR's `read_thrift_vec`** (built on `pr-9868` HEAD 
`c36a092`, ASan + libfuzzer):
   ```
   ==…== ERROR: libFuzzer: out-of-memory (malloc(6643777440))
   …
   #11 0x…  <alloc::raw_vec::RawVec<…SchemaElement>>::try_reserve_exact
                 alloc/src/raw_vec/mod.rs:384
   #12 0x…  parquet::parquet_thrift::read_thrift_vec::<…SchemaElement, 
ThriftSliceInputProtocol>
                 parquet/src/parquet_thrift.rs:713
   #13 0x…  parquet::file::metadata::thrift::parquet_metadata_from_bytes
                 parquet/src/file/metadata/thrift/mod.rs:790
   #14 0x…  parquet::file::metadata::parser::decode_metadata
                 parquet/src/file/metadata/parser.rs:233
   #…       … decode_footer_metadata / try_parse / parse_and_finish
   ```
   
   So the new `try_reserve_exact` path is engaged — but on Linux with default 
overcommit, `try_reserve_exact(N)` still attempts the underlying `malloc(N * 
sizeof(T))` before returning. ASan's allocator sees the requested size and 
signals libFuzzer OOM before any `Err(TryReserveError)` can be observed. The 
`i32::MAX` clamp in `read_list_begin` keeps things off the worst cliff but 
`i32::MAX * sizeof(SchemaElement)` is still ~6.6 GiB.
   
   **Suggested follow-on (separate PR or grow this one — I'm happy either 
way)**: bound `len` by the input bytes still available to the protocol before 
`try_reserve_exact`, mirroring the chunked-read fix in #9869 for `arrow-ipc`. 
Each thrift element occupies at least 1 byte on the wire, so `min(declared_len, 
bytes_remaining)` is a sound upper bound. `ThriftSliceInputProtocol` already 
has `as_slice() -> &[u8]`; promoting a `bytes_remaining(&self) -> usize` to the 
`ThriftCompactInputProtocol` trait would let `read_thrift_vec` bound the 
up-front allocation correctly:
   
   ```rust
   let declared = list_ident.size as usize;
   let bound = declared.min(prot.bytes_remaining());
   let mut res: Vec<T> = Vec::new();
   if res.try_reserve_exact(bound).is_err() {
       return Err(general_err!("cannot allocate Thrift list of {declared} 
elements"));
   }
   ```
   
   That keeps the early-abort intent of this PR but stops the rogue 
`i32::MAX`-class size from ever reaching the allocator. The legacy hand-rolled 
`Vec::with_capacity(list_ident.size as usize)` sites in `parquet/src/format.rs` 
(~22 of them, e.g. lines 5831, 5845 for `Vec<SchemaElement>` / `Vec<RowGroup>`) 
would need the same treatment if/when they're still on the deserialization path.
   
   Happy to send the trait-method + `read_thrift_vec` patch as a follow-up PR 
once #5332's harness lands so we have a CI-side regression test for it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to