This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git


The following commit(s) were added to refs/heads/main by this push:
     new 7f6524def2 fix(parquet): bound schema num_children before 
Vec::with_capacity (#9884)
7f6524def2 is described below

commit 7f6524def267f5c5be73b7d5320185ea9f3bb91f
Author: masumi-ryugo <[email protected]>
AuthorDate: Thu May 7 00:02:47 2026 +0900

    fix(parquet): bound schema num_children before Vec::with_capacity (#9884)
    
    ## What
    
    Validate `SchemaElement::num_children` in
    `parquet::schema::types::schema_from_array_helper` before passing it to
    `Vec::with_capacity`. Reject:
    
    1. **Negative counts** — they wrap to a huge `usize` on the cast, and
    they're nonsensical anyway (`num_children` is a count).
    2. **Counts that exceed the remaining schema-array entries** — each
    declared child must consume at least one entry, so a value larger than
    `num_elements - index - 1` is invalid input. Without this bound the cast
    feeds `Vec::with_capacity(~i32::MAX)` and `alloc::raw_vec::handle_error`
    panics with `capacity overflow` on a 64-bit target.
    
    ```rust
    if n < 0 {
        return Err(general_err!(...));
    }
    let n_children = n as usize;
    let remaining = num_elements.saturating_sub(index + 1);
    if n_children > remaining {
        return Err(general_err!(...));
    }
    let mut fields = Vec::with_capacity(n_children);
    ```
    
    ## How it was found
    
    The `parquet/fuzz/thrift_decode` cargo-fuzz harness from #5332 /
    
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses),
    running for ~30 s on `parquet/main` HEAD with no seed corpus, produces
    this 22-byte input:
    
    ```
    00000000  02 00 2b 11 01 08 00 11  11 00 00 00 00 00 00 00  
|..+.............|
    00000010  f7 f9 11 01 a5 ff                                  |......|
    ```
    
    Backtrace before this patch:
    
    ```
    thread '<unnamed>' panicked at raw_vec/mod.rs:28:5: capacity overflow
       2: alloc::raw_vec::capacity_overflow
       3: alloc::raw_vec::handle_error          raw_vec/mod.rs:889:29
       4: schema_from_array_helper              schema/types.rs:1404
       5: parquet_schema_from_array             schema/types.rs:1304:17
       6: parquet_metadata_from_bytes           metadata/thrift/mod.rs:791:31
       7: decode_metadata                       metadata/parser.rs:233:5
       8: decode_metadata                       metadata/reader.rs:823:9
    ```
    
    After this patch the same bytes return a clean `ParquetError::General`,
    no panic, no allocation spike.
    
    ## Tests
    
    Three new tests in `schema::types::tests`:
    
    - `test_parquet_schema_from_array_rejects_negative_num_children` —
    direct `parquet_schema_from_array` call with `num_children = Some(-1)`.
    - `test_parquet_schema_from_array_rejects_overflowing_num_children` —
    same with `num_children = Some(i32::MAX)`.
    - `test_decode_metadata_oom_repro_schema_overflow_does_not_panic` —
    end-to-end via `ParquetMetaDataReader::decode_metadata` using the
    22-byte repro.
    
    The existing 107 tests under `schema::` continue to pass.
    
    ## Relationship to #9883
    
    Sibling fix, not dependent. #9883 caps the up-front allocation
    **inside** the thrift parser (`read_thrift_vec`); this PR caps the
    allocation **downstream** of the thrift parser, where
    `SchemaElement::num_children` becomes a `Vec::with_capacity` argument.
    They live on different code paths — fuzz finds inputs that hit either
    one independently — and either can land first.
    
    xref #5332 #9883 #9868 #9874
    
    ---------
    
    Co-authored-by: masumi ryugo 
<[email protected]>
---
 parquet/src/schema/types.rs | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/parquet/src/schema/types.rs b/parquet/src/schema/types.rs
index 2c63da74df..0d504e16fc 100644
--- a/parquet/src/schema/types.rs
+++ b/parquet/src/schema/types.rs
@@ -1401,7 +1401,7 @@ fn schema_from_array_helper<'a>(
         Some(n) => {
             let repetition = element.repetition_type;
 
-            let mut fields = Vec::with_capacity(n as usize);
+            let mut fields = Vec::with_capacity(usize::try_from(n)?);
             let mut next_index = index + 1;
             for _ in 0..n {
                 let child_result = schema_from_array_helper(elements, 
num_elements, next_index)?;
@@ -2517,4 +2517,22 @@ mod tests {
         let result_schema = parquet_schema_from_array(thrift_schema).unwrap();
         assert_eq!(result_schema, expected_schema);
     }
+
+    #[test]
+    fn test_parquet_schema_from_array_rejects_negative_num_children() {
+        let elements = vec![SchemaElement {
+            r#type: None,
+            type_length: None,
+            repetition_type: Some(Repetition::REQUIRED),
+            name: "schema",
+            num_children: Some(-1),
+            converted_type: None,
+            scale: None,
+            precision: None,
+            field_id: None,
+            logical_type: None,
+        }];
+        let result = parquet_schema_from_array(elements);
+        assert!(result.unwrap_err().to_string().contains("Integer overflow"));
+    }
 }

Reply via email to