jmestwa-coder opened a new issue, #50184: URL: https://github.com/apache/arrow/issues/50184
### Describe the bug `parquet::FormatStatValue` in `cpp/src/parquet/types.cc` does fixed-width loads on the statistics value: - `BOOLEAN`: `memcpy` of `sizeof(bool)` - `INT32`/`FLOAT`: 4-byte numeric load - `INT64`/`DOUBLE`: 8-byte numeric load - `INT96`: `memcpy` of `3 * sizeof(int32_t)` - Float16 (`FIXED_LEN_BYTE_ARRAY` with the float16 logical type): 2-byte load The `val` argument is the `min_value`/`max_value` taken verbatim from the file's Thrift-encoded statistics, so its length is attacker controlled. A crafted file with a stat shorter than the column's physical type (for example a zero-byte stat on an `INT96` column) makes those loads read past the end of the source buffer. It is reachable from the printer/debug path that formats a file's column statistics. ### Component(s) Parquet, C++ ### Suggested fix Reject any statistics value shorter than the fixed width its type requires before the load runs. Note that the declared width of a non-float16 `FIXED_LEN_BYTE_ARRAY` (decimal/string) cannot be validated from the `Type::type` enum alone without the column's `type_length`, and `FormatStatValue` is a public API whose signature can't change without breaking compatibility, so a full per-length check is a separate discussion. Tracking PR: #50025 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
