masumi-ryugo opened a new pull request, #9884:
URL: https://github.com/apache/arrow-rs/pull/9884
## What
Validate `SchemaElement::num_children` in
`parquet::schema::types::schema_from_array_helper` before passing it to
`Vec::with_capacity`. Reject:
1. **Negative counts** — they wrap to a huge `usize` on the cast, and
they're nonsensical anyway (`num_children` is a count).
2. **Counts that exceed the remaining schema-array entries** — each declared
child must consume at least one entry, so a value larger than `num_elements -
index - 1` is invalid input. Without this bound the cast feeds
`Vec::with_capacity(~i32::MAX)` and `alloc::raw_vec::handle_error` panics with
`capacity overflow` on a 64-bit target.
```rust
if n < 0 {
return Err(general_err!(...));
}
let n_children = n as usize;
let remaining = num_elements.saturating_sub(index + 1);
if n_children > remaining {
return Err(general_err!(...));
}
let mut fields = Vec::with_capacity(n_children);
```
## How it was found
The `parquet/fuzz/thrift_decode` cargo-fuzz harness from #5332 /
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses),
running for ~30 s on `parquet/main` HEAD with no seed corpus, produces this
22-byte input:
```
00000000 02 00 2b 11 01 08 00 11 11 00 00 00 00 00 00 00
|..+.............|
00000010 f7 f9 11 01 a5 ff |......|
```
Backtrace before this patch:
```
thread '<unnamed>' panicked at raw_vec/mod.rs:28:5: capacity overflow
2: alloc::raw_vec::capacity_overflow
3: alloc::raw_vec::handle_error raw_vec/mod.rs:889:29
4: schema_from_array_helper schema/types.rs:1404
5: parquet_schema_from_array schema/types.rs:1304:17
6: parquet_metadata_from_bytes metadata/thrift/mod.rs:791:31
7: decode_metadata metadata/parser.rs:233:5
8: decode_metadata metadata/reader.rs:823:9
```
After this patch the same bytes return a clean `ParquetError::General`, no
panic, no allocation spike.
## Tests
Three new tests in `schema::types::tests`:
- `test_parquet_schema_from_array_rejects_negative_num_children` — direct
`parquet_schema_from_array` call with `num_children = Some(-1)`.
- `test_parquet_schema_from_array_rejects_overflowing_num_children` — same
with `num_children = Some(i32::MAX)`.
- `test_decode_metadata_oom_repro_schema_overflow_does_not_panic` —
end-to-end via `ParquetMetaDataReader::decode_metadata` using the 22-byte repro.
The existing 107 tests under `schema::` continue to pass.
## Relationship to #9883
Sibling fix, not dependent. #9883 caps the up-front allocation **inside**
the thrift parser (`read_thrift_vec`); this PR caps the allocation
**downstream** of the thrift parser, where `SchemaElement::num_children`
becomes a `Vec::with_capacity` argument. They live on different code paths —
fuzz finds inputs that hit either one independently — and either can land first.
xref #5332 #9883 #9868 #9874
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]