In my opinion, the more consistently they are handled the better the
ecosystem as a whole would be (we would waste less time chasing down
seeming incompatibilities)

Given its predominance in the ecosystem I would personally suggest updating
the other readers to follow the parquet-vava implementation if practical.

Thanks,
Andrew

On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl <etse...@live.com> wrote:

> An issue was recently raised [1] in arrow-rs questioning the reading of a
> file that had improperly encoded UINT_8 and UINT_16 columns. For instance,
> a UINT_8 value of 238 (0xee) was plain encoded as 0xffffffee. When read by
> parquet-rs, a value of null was returned. For the same file, parquet-java
> (well, parquet-cli cat) returned -18, and arrow-cpp returned 238.
>
> The Parquet specification [2] states that behavior in this case is
> undefined, so all three readers are correct. I'm just wondering if there is
> any desire in the community to suggest handling such malformed data in a
> more consistent fashion, or just leave UB as UB.
>
> Thanks,
> Ed
>
> [1] https://github.com/apache/arrow-rs/issues/7040
> [2]
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers
>

Reply via email to