pitrou commented on issue #47981:
URL: https://github.com/apache/arrow/issues/47981#issuecomment-3460415464

   Ok, I looked at this.
   
   These are extremely old Parquet files that were generated while the [Parquet 
format](https://github.com/apache/parquet-format/) was still being designed.
   
   One of the columns, of length 25, has RLE_BIT_PACKED-encoded dictionary 
integers over a bit width of 5. The RLE_BIT_PACKED data is 17 bytes and 
contains a single bit-packed run with count=4, which is really 32 values. But 
17 bytes (of which 1 byte header) cannot encode 32 5-bit values (only 25). So, 
while the RLE_BIT_PACKED data can technically encode the 25 integers, it lacks 
the padding implicit in the spec.
   
   Now, we could relax this a bit in our decoder, but the question is whether 
more recent files share this property.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to