jmestwa-coder opened a new issue, #50184:
URL: https://github.com/apache/arrow/issues/50184

   ### Describe the bug
   
   `parquet::FormatStatValue` in `cpp/src/parquet/types.cc` does fixed-width 
loads on the statistics value:
   
   - `BOOLEAN`: `memcpy` of `sizeof(bool)`
   - `INT32`/`FLOAT`: 4-byte numeric load
   - `INT64`/`DOUBLE`: 8-byte numeric load
   - `INT96`: `memcpy` of `3 * sizeof(int32_t)`
   - Float16 (`FIXED_LEN_BYTE_ARRAY` with the float16 logical type): 2-byte load
   
   The `val` argument is the `min_value`/`max_value` taken verbatim from the 
file's Thrift-encoded statistics, so its length is attacker controlled. A 
crafted file with a stat shorter than the column's physical type (for example a 
zero-byte stat on an `INT96` column) makes those loads read past the end of the 
source buffer.
   
   It is reachable from the printer/debug path that formats a file's column 
statistics.
   
   ### Component(s)
   
   Parquet, C++
   
   ### Suggested fix
   
   Reject any statistics value shorter than the fixed width its type requires 
before the load runs. Note that the declared width of a non-float16 
`FIXED_LEN_BYTE_ARRAY` (decimal/string) cannot be validated from the 
`Type::type` enum alone without the column's `type_length`, and 
`FormatStatValue` is a public API whose signature can't change without breaking 
compatibility, so a full per-length check is a separate discussion.
   
   Tracking PR: #50025
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to