pitrou commented on issue #47981: URL: https://github.com/apache/arrow/issues/47981#issuecomment-3460415464
Ok, I looked at this. These are extremely old Parquet files that were generated while the [Parquet format](https://github.com/apache/parquet-format/) was still being designed. One of the columns, of length 25, has RLE_BIT_PACKED-encoded dictionary integers over a bit width of 5. The RLE_BIT_PACKED data is 17 bytes and contains a single bit-packed run with count=4, which is really 32 values. But 17 bytes (of which 1 byte header) cannot encode 32 5-bit values (only 25). So, while the RLE_BIT_PACKED data can technically encode the 25 integers, it lacks the padding implicit in the spec. Now, we could relax this a bit in our decoder, but the question is whether more recent files share this property. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
