masumi-ryugo opened a new pull request, #9868:
URL: https://github.com/apache/arrow-rs/pull/9868

   # Which issue does this PR close?
   
   N/A — found via `cargo-fuzz` libFuzzer harness over
   `ParquetMetaDataReader::parse_and_finish` and
   `ParquetRecordBatchReaderBuilder::try_new`. Happy to file a tracking
   issue first if maintainers prefer.
   
   # Rationale for this change
   
   `parquet::parquet_thrift::read_thrift_vec` reads a Thrift compact-protocol
   list header and then calls `Vec::with_capacity(list_ident.size as usize)`
   where `list_ident.size` is a 32-bit varint pulled directly from
   attacker-controlled bytes. On a malformed input the value can be close to
   `i32::MAX`, which after the per-element-size multiplication is a
   multi-GB-to-multi-hundred-GB allocation request. That panics in
   `alloc::raw_vec::handle_error` with `capacity overflow` (or OOM-kills the
   process on smaller-but-still-huge values) before any element is decoded.
   
   Every public sync API that parses parquet bytes funnels through this
   function:
   
   - `ParquetMetaDataReader::parse_and_finish`
   - `ParquetMetaDataReader::decode_metadata`
   - `SerializedFileReader::new`
   - `ParquetRecordBatchReaderBuilder::try_new`
   - the `async_reader` family (re-exports `decode_metadata` after
     prefetching footer bytes)
   
   So any downstream code that hands attacker-controlled bytes to a parquet
   reader gets a panic-on-decode DoS. Per `SECURITY.md` this is a bug, not a
   vulnerability — there is no information disclosure or RCE path, only
   availability — but it is reachable from every metadata entry point and
   worth closing.
   
   # What changes are included in this PR?
   
   - Add a default `remaining_bytes() -> Option<usize>` method to
     `ThriftCompactInputProtocol`, returning the number of bytes still
     available when the protocol is backed by an in-memory buffer.
   - Override it on `ThriftSliceInputProtocol` to return `Some(self.buf.len())`.
   - In `read_thrift_vec`, clamp the capacity passed to `Vec::with_capacity`
     by `remaining_bytes()` (every Thrift element costs at least one wire
     byte, so a list whose declared `size` exceeds the remaining input is
     malformed by definition — the same clamp the C++ Thrift runtime
     applies). Streaming protocols that cannot report a remaining count
     fall back to a 1 MiB cap and grow on push.
   - Reject negative sizes explicitly. Although the wire format is unsigned,
     a hostile encoder can construct a varint that decodes to a negative
     `i32`.
   - Three regression tests in the existing `parquet_thrift::tests` module:
     - `test_read_thrift_vec_huge_size_does_not_panic` exercises the direct
       `read_thrift_vec` path with a varint that decodes to ~`i32::MAX`.
     - `test_decode_metadata_huge_thrift_list_does_not_panic` runs the
       10-byte fuzzer repro through `ParquetMetaDataReader::decode_metadata`
       and asserts a clean `Err`.
     - `test_read_thrift_vec_negative_size_returns_err` asserts the
       explicit negative-size guard.
   
   # Are these changes tested?
   
   Yes — three new unit tests as above. `cargo test -p parquet --release`,
   `cargo clippy -p parquet --all-targets -- -D warnings`, and
   `cargo fmt --check` are all clean.
   
   (Note: a single pre-existing `should_panic` test
   `test_read_non_utf8_binary_as_utf8` aborts during the rust test
   harness's panic-formatting machinery on `origin/main` HEAD itself, so
   it is excluded with `--skip` for the local run. Unrelated to this PR.)
   
   # Are there any user-facing changes?
   
   Yes — malformed Parquet inputs that previously triggered a `capacity
   overflow` panic during metadata decoding now return a `ParquetError`
   that callers can handle. No behavior change for well-formed files.
   
   # Reproducer
   
   10 bytes:
   
   ```
   0x28 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0x51
   ```
   
   ```rust
   let bytes: &[u8] = &[0x28, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 
0x51];
   let _ = 
parquet::file::metadata::ParquetMetaDataReader::decode_metadata(bytes);
   ```
   
   Before this PR (RUST_BACKTRACE=1):
   
   ```
   panicked at library/alloc/src/raw_vec/mod.rs:28:5: capacity overflow
      3: alloc::raw_vec::handle_error
      4: alloc::raw_vec::with_capacity_in
      7: alloc::vec::Vec::with_capacity::<SchemaElement>
      8: parquet::parquet_thrift::read_thrift_vec       (parquet_thrift.rs:690)
      9: parquet::file::metadata::thrift::parquet_metadata_from_bytes
                                                        (thrift/mod.rs:790)
     10: parquet::file::metadata::parser::decode_metadata (parser.rs:233)
     13: ParquetMetaDataReader::decode_footer_metadata   (reader.rs:783)
     14: ParquetMetaDataReader::parse_metadata           (reader.rs:585)
     17: ParquetMetaDataReader::parse_and_finish         (reader.rs:230)
   ```
   
   After this PR: `Err(ParquetError::EOF("Unexpected EOF"))` (or similar
   shape depending on which struct field fails first), no panic.
   
   A 45-byte reproducer that drives the full reader stack (with valid
   `PAR1` magic, so it works for `ParquetRecordBatchReaderBuilder::try_new`)
   behaves the same way.
   
   # Found via
   
   `cargo-fuzz` libFuzzer harness wrapping
   `ParquetMetaDataReader::parse_and_finish` and
   `ParquetRecordBatchReaderBuilder::try_new` over `bytes::Bytes`. ~95 unique
   crashing inputs (different VLQ sizes, different element types —
   `SchemaElement`, `RowGroup`, `ColumnChunk` — different offsets in the
   metadata graph) all converged on this single root cause within ~3 minutes
   of single-thread fuzzing. One bug, many surface symptoms.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to