The challenge with undefined behaviour is it inherently leads to 
incompatibilities, which in turn wastes people's time chasing ghosts. 

In this case I would suggest explicitly disallowing such data, and encourage 
implementations to return an error. This is simple, avoids adding complexity to 
the specification, and avoids the potential can of worms as to how to apply 
this masking logic for statistics, bloom filters, etc... 

I don't believe any real parquet writer will produce such data, and so I think 
we should just take the simplest option of not allowing it.

Kind Regards,

Raphael



On 3 February 2025 09:25:42 GMT, Gang Wu <ust...@gmail.com> wrote:
>I'm inclined to leave it as an undefined behavior from the perspective of
>spec to keep the spec simple.
>
>On Mon, Feb 3, 2025 at 4:57 PM Alkis Evlogimenos
><alkis.evlogime...@databricks.com.invalid> wrote:
>
>> Wouldn't it be better to return 238 in this case? What does it mean for
>> parquet-java to return -18 when the logical type is UINT8?
>>
>> Thinking a bit further is this something we perhaps want to specify?
>> Something like:
>> - if the stored value is larger than the maximum allowed by the annotation
>> only the lower N bits are taking into account
>> - if the stored value is smaller than the maximum allowed by the annotation
>> the value is sign-extended if signed otherwise zero-extended
>>
>>
>> On Sat, Feb 1, 2025 at 5:00 PM Andrew Lamb <andrewlam...@gmail.com> wrote:
>>
>> > In my opinion, the more consistently they are handled the better the
>> > ecosystem as a whole would be (we would waste less time chasing down
>> > seeming incompatibilities)
>> >
>> > Given its predominance in the ecosystem I would personally suggest
>> updating
>> > the other readers to follow the parquet-vava implementation if practical.
>> >
>> > Thanks,
>> > Andrew
>> >
>> > On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl <etse...@live.com> wrote:
>> >
>> > > An issue was recently raised [1] in arrow-rs questioning the reading
>> of a
>> > > file that had improperly encoded UINT_8 and UINT_16 columns. For
>> > instance,
>> > > a UINT_8 value of 238 (0xee) was plain encoded as 0xffffffee. When read
>> > by
>> > > parquet-rs, a value of null was returned. For the same file,
>> parquet-java
>> > > (well, parquet-cli cat) returned -18, and arrow-cpp returned 238.
>> > >
>> > > The Parquet specification [2] states that behavior in this case is
>> > > undefined, so all three readers are correct. I'm just wondering if
>> there
>> > is
>> > > any desire in the community to suggest handling such malformed data in
>> a
>> > > more consistent fashion, or just leave UB as UB.
>> > >
>> > > Thanks,
>> > > Ed
>> > >
>> > > [1] https://github.com/apache/arrow-rs/issues/7040
>> > > [2]
>> > >
>> >
>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers
>> > >
>> >
>>

Reply via email to