Wouldn't it be better to return 238 in this case? What does it mean for
parquet-java to return -18 when the logical type is UINT8?

Thinking a bit further is this something we perhaps want to specify?
Something like:
- if the stored value is larger than the maximum allowed by the annotation
only the lower N bits are taking into account
- if the stored value is smaller than the maximum allowed by the annotation
the value is sign-extended if signed otherwise zero-extended


On Sat, Feb 1, 2025 at 5:00 PM Andrew Lamb <andrewlam...@gmail.com> wrote:

> In my opinion, the more consistently they are handled the better the
> ecosystem as a whole would be (we would waste less time chasing down
> seeming incompatibilities)
>
> Given its predominance in the ecosystem I would personally suggest updating
> the other readers to follow the parquet-vava implementation if practical.
>
> Thanks,
> Andrew
>
> On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl <etse...@live.com> wrote:
>
> > An issue was recently raised [1] in arrow-rs questioning the reading of a
> > file that had improperly encoded UINT_8 and UINT_16 columns. For
> instance,
> > a UINT_8 value of 238 (0xee) was plain encoded as 0xffffffee. When read
> by
> > parquet-rs, a value of null was returned. For the same file, parquet-java
> > (well, parquet-cli cat) returned -18, and arrow-cpp returned 238.
> >
> > The Parquet specification [2] states that behavior in this case is
> > undefined, so all three readers are correct. I'm just wondering if there
> is
> > any desire in the community to suggest handling such malformed data in a
> > more consistent fashion, or just leave UB as UB.
> >
> > Thanks,
> > Ed
> >
> > [1] https://github.com/apache/arrow-rs/issues/7040
> > [2]
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers
> >
>

Reply via email to