Re: [PR] fix(parquet): bound thrift list capacity by remaining input bytes [arrow-rs]

via GitHub Mon, 04 May 2026 12:31:28 -0700


etseidl commented on code in PR #9883:
URL: https://github.com/apache/arrow-rs/pull/9883#discussion_r3183967285



##########
parquet/src/parquet_thrift.rs:
##########
@@ -1106,4 +1166,63 @@ pub(crate) mod tests {
         assert_eq!(header.size, 0);
         assert_eq!(header.element_type, ElementType::Byte);
     }
+
+    /// Regression test for the 10-byte fuzzer repro that panicked inside
+    /// `alloc::raw_vec::handle_error` with "capacity overflow" before this
+    /// fix: the list header for `SchemaElement` decodes to a multi-billion
+    /// `size`, and `Vec::with_capacity(size)` then tripped the
+    /// `size_of::<SchemaElement>() * size` isize-overflow check. After the
+    /// fix the allocation goes through `try_reserve_exact` and returns a
+    /// clean `ParquetError`.
+    #[test]
+    fn test_decode_metadata_huge_thrift_list_does_not_panic() {
+        let bytes: [u8; 10] = [0x28, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 
0xfc, 0x51];
+        let result = 
crate::file::metadata::ParquetMetaDataReader::decode_metadata(&bytes);
+        // The bytes are not valid Parquet metadata — we only require that
+        // `decode_metadata` returns an error rather than panicking.
+        assert!(result.is_err(), "expected error, got {result:?}");
+    }
+
+    /// A Thrift list header whose `size` varint decodes above `i32::MAX`
+    /// must be rejected at the protocol layer rather than wrapping into a
+    /// negative `i32` and being smuggled into downstream allocation code.
+    #[test]
+    fn test_read_list_begin_size_above_i32_max_returns_err() {
+        // List header: element_type=8 (Binary), 0xF=follow-up varint.
+        // Varint 80 80 80 80 08 decodes to 0x8000_0000 = i32::MAX + 1.
+        let mut data: Vec<u8> = vec![0xF8];
+        data.extend_from_slice(&[0x80, 0x80, 0x80, 0x80, 0x08]);
+        let mut prot = ThriftSliceInputProtocol::new(&data);
+        let result = prot.read_list_begin();
+        assert!(result.is_err(), "expected error, got {result:?}");
+    }
+
+    /// Regression test for a 7-byte cargo-fuzz repro found by the
+    /// `parquet/fuzz` `thrift_decode` harness on `parquet/main` after the
+    /// `try_reserve_exact` change: the fix above clamps `i32::MAX`-class
+    /// sizes, but on a 64-bit target the allocator is still asked for
+    /// ~6.6 GiB (`i32::MAX * size_of::<SchemaElement>()`) before the
+    /// `try_reserve_exact` Err path can fire, which is enough to OOM the
+    /// process under libFuzzer / ASan. Bounding by `bytes_remaining` keeps
+    /// the up-front allocation proportional to the input size, so on the
+    /// 7-byte input below the allocator is asked for at most a few bytes.
+    #[test]
+    fn test_decode_metadata_oom_repro_7byte_does_not_oom() {

Review Comment:
   The added tests does not error on either Mac nor Linux when added to #9868. 
If they require a special allocator to trigger, then they provide no extra 
coverage.



##########
parquet/src/parquet_thrift.rs:
##########
@@ -686,8 +727,27 @@ where
     R: ThriftCompactInputProtocol<'a>,
     T: ReadThrift<'a, R>,
 {
+    // `read_list_begin` rejects sizes above `i32::MAX`, but on a 64-bit
+    // target that is still ~2 GiB of `T`s — easily a multi-GB up-front
+    // allocation when `T` is a struct like `SchemaElement`. Cap the
+    // up-front capacity by the bytes still in the input: each Thrift list
+    // element occupies at least one byte on the wire, so a declared size
+    // larger than what the reader has left is always invalid input. The
+    // subsequent `try_reserve_exact` then serves as belt-and-suspenders
+    // for protocol implementations that can't compute a tight bound.
     let list_ident = prot.read_list_begin()?;
-    let mut res = Vec::with_capacity(list_ident.size as usize);
+    let declared = list_ident.size as usize;
+    let bound = declared.min(prot.bytes_remaining());

Review Comment:
   As mentioned in 
https://github.com/apache/arrow-rs/pull/9868#issuecomment-4373820729, the size 
of the input stream has little relation to the size of the resultant structures 
as the former is highly compressed. Also, this code here is simply 
incorrect...`bound` has nothing to do with the number of elements the vector 
should ultimately hold.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(parquet): bound thrift list capacity by remaining input bytes [arrow-rs]

Reply via email to