pitrou opened a new issue, #48234:
URL: https://github.com/apache/arrow/issues/48234

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   The `data_size - 4` check in this snippet does not have a rational 
justification and looks overly strict:
   
https://github.com/apache/arrow/blob/55587efbf4f272afda97bff2f33d6aaf4b4c0c8a/cpp/src/parquet/column_reader.cc#L128-L145
   
   It probably doesn't matter in most cases, as there are at least 4 bytes of 
encoded page values after the levels, but in some fringe cases (such as 
all/most values being null) there might not.
   
   The original chance from `data_size` to `data_size - 4` was made by me in 
https://github.com/apache/arrow/pull/6848 and it was probably based on a 
misunderstanding, or just blindly copying the similar check for `Encoding::RLE` 
(which, unlike `Encoding::BIT_PACKED`, does have a 4-byte length header).
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to