pitrou opened a new issue, #48234: URL: https://github.com/apache/arrow/issues/48234
### Describe the bug, including details regarding any error messages, version, and platform. The `data_size - 4` check in this snippet does not have a rational justification and looks overly strict: https://github.com/apache/arrow/blob/55587efbf4f272afda97bff2f33d6aaf4b4c0c8a/cpp/src/parquet/column_reader.cc#L128-L145 It probably doesn't matter in most cases, as there are at least 4 bytes of encoded page values after the levels, but in some fringe cases (such as all/most values being null) there might not. The original chance from `data_size` to `data_size - 4` was made by me in https://github.com/apache/arrow/pull/6848 and it was probably based on a misunderstanding, or just blindly copying the similar check for `Encoding::RLE` (which, unlike `Encoding::BIT_PACKED`, does have a 4-byte length header). ### Component(s) C++, Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
