[I] Enhance sanity check on Parquet metadata [arrow-rs]

via GitHub Mon, 12 Aug 2024 15:02:17 -0700


jp0317 opened a new issue, #6228:
URL: https://github.com/apache/arrow-rs/issues/6228


   **Describe the bug**
   It seems the current codes lack sanity checks on metadata, making it 
vulnerable to corrupted files. The following gives a few example:
   
   1.  There are [a bunch 
of](https://github.com/apache/arrow-rs/blob/master/parquet/src/file/serialized_reader.rs#L445-L475)
 `i32 as u32` without checking if the `i32` is negative. Unfortunately these 
`u32` may be used to guide buffer allocations (e.g., 
[here](https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader/byte_array.rs#L364))
 when reading data. 
   
   2. The `read_records` does not validate the [levels_read from 
read_rep_levels](https://github.com/apache/arrow-rs/blob/master/parquet/src/column/reader.rs#L240-L241).
 A corrupted file may cause the `read_rep_levels` return 0, which could lead to 
[infinite 
loop](https://github.com/apache/arrow-rs/blob/master/parquet/src/column/reader.rs#L230),
   
   **To Reproduce**
   
   Using the `read_parquet` example to read the two bad files from 
[here](https://github.com/jp0317/parquet-testing/commit/d6ad56b337166cc8c36eda2496812a05beca1368).
 Reading `bad-dict-page-header.parquet` may give an EOF error but  internally 
the library already has called  a `Vec::reserve(N)` where N is from a negative 
i32.  Reading `bad-levels.parquet` would simply stuck in infinite loop.
   
   **Expected behavior**
   Examine the metadata and return proper errors 
   
   **Additional context**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Enhance sanity check on Parquet metadata [arrow-rs]

Reply via email to