jp0317 opened a new issue, #6228: URL: https://github.com/apache/arrow-rs/issues/6228
**Describe the bug** It seems the current codes lack sanity checks on metadata, making it vulnerable to corrupted files. The following gives a few example: 1. There are [a bunch of](https://github.com/apache/arrow-rs/blob/master/parquet/src/file/serialized_reader.rs#L445-L475) `i32 as u32` without checking if the `i32` is negative. Unfortunately these `u32` may be used to guide buffer allocations (e.g., [here](https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader/byte_array.rs#L364)) when reading data. 2. The `read_records` does not validate the [levels_read from read_rep_levels](https://github.com/apache/arrow-rs/blob/master/parquet/src/column/reader.rs#L240-L241). A corrupted file may cause the `read_rep_levels` return 0, which could lead to [infinite loop](https://github.com/apache/arrow-rs/blob/master/parquet/src/column/reader.rs#L230), **To Reproduce** Using the `read_parquet` example to read the two bad files from [here](https://github.com/jp0317/parquet-testing/commit/d6ad56b337166cc8c36eda2496812a05beca1368). Reading `bad-dict-page-header.parquet` may give an EOF error but internally the library already has called a `Vec::reserve(N)` where N is from a negative i32. Reading `bad-levels.parquet` would simply stuck in infinite loop. **Expected behavior** Examine the metadata and return proper errors **Additional context** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
