I agree that ambiguity is not good. The spec is clear that writer
implementations should not write these values. However, it is tricky
on the reader side. IMHO, it is similar to the case of schema evolution
when we down cast int32 to int8. It is engine-specific to throw an error
or return the lower 8 bits, and some engines use wider types to return
the exact values.

On Mon, Feb 3, 2025 at 6:28 PM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

> Right, but the purpose of the specification is to say what
> implementations should do, not necessarily what they do currently, and
> that way over time implementations will converge.
>
> There will always be non-spec compliant implementations, the point of a
> specification is to define what correct looks like. If implementations
> do something different, that is then a problem for those implementations
> to fix, and not every other implementation to accommodate. Ambiguity in
> the specification just forces every implementation to make value
> judgements about what non-standardized behaviour they should choose to
> have, which seems unfortunate.
>
> Kind Regards,
>
> Raphael
>
> On 03/02/2025 10:16, Gang Wu wrote:
> > The problem is that we cannot control all the implementations in the
> wild.
> >
> > Therefore we can only suggest that writers should not write such kind of
> > data and readers are free to return an error or anything.
> >
> > On Mon, Feb 3, 2025 at 6:05 PM Raphael Taylor-Davies
> > <r.taylordav...@googlemail.com.invalid> wrote:
> >
> >> The challenge with undefined behaviour is it inherently leads to
> >> incompatibilities, which in turn wastes people's time chasing ghosts.
> >>
> >> In this case I would suggest explicitly disallowing such data, and
> >> encourage implementations to return an error. This is simple, avoids
> adding
> >> complexity to the specification, and avoids the potential can of worms
> as
> >> to how to apply this masking logic for statistics, bloom filters, etc...
> >>
> >> I don't believe any real parquet writer will produce such data, and so I
> >> think we should just take the simplest option of not allowing it.
> >>
> >> Kind Regards,
> >>
> >> Raphael
> >>
> >>
> >>
> >> On 3 February 2025 09:25:42 GMT, Gang Wu <ust...@gmail.com> wrote:
> >>> I'm inclined to leave it as an undefined behavior from the perspective
> of
> >>> spec to keep the spec simple.
> >>>
> >>> On Mon, Feb 3, 2025 at 4:57 PM Alkis Evlogimenos
> >>> <alkis.evlogime...@databricks.com.invalid> wrote:
> >>>
> >>>> Wouldn't it be better to return 238 in this case? What does it mean
> for
> >>>> parquet-java to return -18 when the logical type is UINT8?
> >>>>
> >>>> Thinking a bit further is this something we perhaps want to specify?
> >>>> Something like:
> >>>> - if the stored value is larger than the maximum allowed by the
> >> annotation
> >>>> only the lower N bits are taking into account
> >>>> - if the stored value is smaller than the maximum allowed by the
> >> annotation
> >>>> the value is sign-extended if signed otherwise zero-extended
> >>>>
> >>>>
> >>>> On Sat, Feb 1, 2025 at 5:00 PM Andrew Lamb <andrewlam...@gmail.com>
> >> wrote:
> >>>>> In my opinion, the more consistently they are handled the better the
> >>>>> ecosystem as a whole would be (we would waste less time chasing down
> >>>>> seeming incompatibilities)
> >>>>>
> >>>>> Given its predominance in the ecosystem I would personally suggest
> >>>> updating
> >>>>> the other readers to follow the parquet-vava implementation if
> >> practical.
> >>>>> Thanks,
> >>>>> Andrew
> >>>>>
> >>>>> On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl <etse...@live.com> wrote:
> >>>>>
> >>>>>> An issue was recently raised [1] in arrow-rs questioning the reading
> >>>> of a
> >>>>>> file that had improperly encoded UINT_8 and UINT_16 columns. For
> >>>>> instance,
> >>>>>> a UINT_8 value of 238 (0xee) was plain encoded as 0xffffffee. When
> >> read
> >>>>> by
> >>>>>> parquet-rs, a value of null was returned. For the same file,
> >>>> parquet-java
> >>>>>> (well, parquet-cli cat) returned -18, and arrow-cpp returned 238.
> >>>>>>
> >>>>>> The Parquet specification [2] states that behavior in this case is
> >>>>>> undefined, so all three readers are correct. I'm just wondering if
> >>>> there
> >>>>> is
> >>>>>> any desire in the community to suggest handling such malformed data
> >> in
> >>>> a
> >>>>>> more consistent fashion, or just leave UB as UB.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Ed
> >>>>>>
> >>>>>> [1] https://github.com/apache/arrow-rs/issues/7040
> >>>>>> [2]
> >>>>>>
> >>
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers
>

Reply via email to