masumi-ryugo opened a new pull request, #9884:
URL: https://github.com/apache/arrow-rs/pull/9884

   ## What
   
   Validate `SchemaElement::num_children` in 
`parquet::schema::types::schema_from_array_helper` before passing it to 
`Vec::with_capacity`. Reject:
   
   1. **Negative counts** — they wrap to a huge `usize` on the cast, and 
they're nonsensical anyway (`num_children` is a count).
   2. **Counts that exceed the remaining schema-array entries** — each declared 
child must consume at least one entry, so a value larger than `num_elements - 
index - 1` is invalid input. Without this bound the cast feeds 
`Vec::with_capacity(~i32::MAX)` and `alloc::raw_vec::handle_error` panics with 
`capacity overflow` on a 64-bit target.
   
   ```rust
   if n < 0 {
       return Err(general_err!(...));
   }
   let n_children = n as usize;
   let remaining = num_elements.saturating_sub(index + 1);
   if n_children > remaining {
       return Err(general_err!(...));
   }
   let mut fields = Vec::with_capacity(n_children);
   ```
   
   ## How it was found
   
   The `parquet/fuzz/thrift_decode` cargo-fuzz harness from #5332 / 
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses),
 running for ~30 s on `parquet/main` HEAD with no seed corpus, produces this 
22-byte input:
   
   ```
   00000000  02 00 2b 11 01 08 00 11  11 00 00 00 00 00 00 00  
|..+.............|
   00000010  f7 f9 11 01 a5 ff                                  |......|
   ```
   
   Backtrace before this patch:
   
   ```
   thread '<unnamed>' panicked at raw_vec/mod.rs:28:5: capacity overflow
      2: alloc::raw_vec::capacity_overflow
      3: alloc::raw_vec::handle_error          raw_vec/mod.rs:889:29
      4: schema_from_array_helper              schema/types.rs:1404
      5: parquet_schema_from_array             schema/types.rs:1304:17
      6: parquet_metadata_from_bytes           metadata/thrift/mod.rs:791:31
      7: decode_metadata                       metadata/parser.rs:233:5
      8: decode_metadata                       metadata/reader.rs:823:9
   ```
   
   After this patch the same bytes return a clean `ParquetError::General`, no 
panic, no allocation spike.
   
   ## Tests
   
   Three new tests in `schema::types::tests`:
   
   - `test_parquet_schema_from_array_rejects_negative_num_children` — direct 
`parquet_schema_from_array` call with `num_children = Some(-1)`.
   - `test_parquet_schema_from_array_rejects_overflowing_num_children` — same 
with `num_children = Some(i32::MAX)`.
   - `test_decode_metadata_oom_repro_schema_overflow_does_not_panic` — 
end-to-end via `ParquetMetaDataReader::decode_metadata` using the 22-byte repro.
   
   The existing 107 tests under `schema::` continue to pass.
   
   ## Relationship to #9883
   
   Sibling fix, not dependent. #9883 caps the up-front allocation **inside** 
the thrift parser (`read_thrift_vec`); this PR caps the allocation 
**downstream** of the thrift parser, where `SchemaElement::num_children` 
becomes a `Vec::with_capacity` argument. They live on different code paths — 
fuzz finds inputs that hit either one independently — and either can land first.
   
   xref #5332 #9883 #9868 #9874
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to