An issue was recently raised [1] in arrow-rs questioning the reading of a file that had improperly encoded UINT_8 and UINT_16 columns. For instance, a UINT_8 value of 238 (0xee) was plain encoded as 0xffffffee. When read by parquet-rs, a value of null was returned. For the same file, parquet-java (well, parquet-cli cat) returned -18, and arrow-cpp returned 238.
The Parquet specification [2] states that behavior in this case is undefined, so all three readers are correct. I'm just wondering if there is any desire in the community to suggest handling such malformed data in a more consistent fashion, or just leave UB as UB. Thanks, Ed [1] https://github.com/apache/arrow-rs/issues/7040 [2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers