An issue was recently raised [1] in arrow-rs questioning the reading of a file 
that had improperly encoded UINT_8 and UINT_16 columns. For instance, a UINT_8 
value of 238 (0xee) was plain encoded as 0xffffffee. When read by parquet-rs, a 
value of null was returned. For the same file, parquet-java (well, parquet-cli 
cat) returned -18, and arrow-cpp returned 238.

The Parquet specification [2] states that behavior in this case is undefined, 
so all three readers are correct. I'm just wondering if there is any desire in 
the community to suggest handling such malformed data in a more consistent 
fashion, or just leave UB as UB.

Thanks,
Ed

[1] https://github.com/apache/arrow-rs/issues/7040
[2] 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers

Reply via email to