Re: [DISCUSS] Handling of malformed UINT8 and UINT16 data

Parth Chandra Mon, 10 Feb 2025 09:05:34 -0800

On Sun, Feb 9, 2025 at 12:39 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:


> Based on discussion on some of the issues it sounds like there are no known
> production systems that produce this sort of invalid files (it is only a
> test harness).
>

The test harness merely demonstrates the issue. It does not eliminate the
possibility of the same issue
existing elsewhere.
ExampleParquetWriter itself simply extends ParquetWriter which is a public
class and could be the basis for any
production system.
Spark and pyarrow produce different results for such values. Spark itself
does not have unsigned ints but it specifically
implemented support for unsigned ints. I'm only guessing that the
requirement must have come from some non-synthetic use case.

Parth


> On Tue, Feb 4, 2025 at 8:44 PM Parth Chandra <par...@apache.org> wrote:
>
> > I logged the original issue in arrow-rs and a corresponding issue in
> > Parquet [1] while working with Datafusion Comet.
> > For background, Comet uses the ExampleParquetWriter to write a Parquet
> file
> > used by its unit tests which compare the results returned by Comet to the
> > results returned by Spark.
> > @Steve Loughran <ste...@cloudera.com> asked a relevant question -
> > >Is the invalid data detected and reported as such by the different
> parsers
> > > -or does it get returned and the apps are then expected to notice the
> > > problem?
> > The problem is exactly that the different implementations return
> different
> > results and expect the applications to handle it.
> >
> > I don't have a binding say in this matter but any decision that results
> in
> > consistent behavior would get my vote.
> >
> >  - Parth
> >
> > [1]  https://github.com/apache/parquet-java/issues/3142
> >
> > On Tue, Feb 4, 2025 at 2:44 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > If we want to recommend something, then we should certainly recommend
> > > returning an error.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Tue, 4 Feb 2025 01:19:14 +0000
> > > Ed Seidl <etse...@live.com> wrote:
> > > > So far I’ve tested parquet-rs, pyarrow/arrow-cpp, and parquet-java.
> > None
> > > of them return an error. Parquet-rs will return either null or 238
> > > depending on which API is used, pyarrow returns 238, and
> > spark/parquet-java
> > > returns -18.
> > > >
> > > > If we don’t want to mandate that readers must ignore bits outside the
> > > specified range, then I agree with Raphael that it’s probably best to
> > > return an error here. Then at least users will know there’s a problem
> > with
> > > how the file has been produced.
> > > >
> > > > Ed
> > > >
> > > > On 2/3/25, 11:33 AM, "Steve Loughran" <ste...@cloudera.com.INVALID>
> > > wrote:
> > > >
> > > > > Is the invalid data detected and reported as such by the different
> > > parsers
> > > > > -or does it get returned and the apps are then expected to notice
> the
> > > > > problem?
> > > >
> > > > > as that reporting of problems is what i care about.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > On Mon, 3 Feb 2025 at 16:53, Micah Kornfield <
> emkornfi...@gmail.com>
> > > wrote:
> > > >
> > > > > Could we take a step back? Is this a real bug (and at what layer),
> it
> > > seems
> > > > > that at least paquet-java is not validating data being passed to it
> > > when
> > > > > writing? Are there other known implementations that write this
> > > malformed
> > > > > data?
> > > > >
> > > > > On the topic of whether to keep behavior undefined, having gone
> > > through a
> > > > > very long discussion on possible bugs for variant shredding, I see
> > very
> > > > > similar themes discussed  here.  It would be good to come to a
> > > consensus as
> > > > > a community on the general topic of undefined behavior and how we
> > > intend to
> > > > > handle it (maybe we can fork this thread).
> > > > >
> > > > > My two-sense.  I think retroactively defining behavior is not a
> great
> > > idea
> > > > > but in some cases might be warranted.  My preferred approach is
> > keeping
> > > > > notes someplace on known bugs of writers in a centralized location
> > and
> > > > > letting readers decide on how to handle it makes sense (we for
> > > instance do
> > > > > this for some cases when we know stats are incorrect).  I also
> think
> > we
> > > > > really need a validation tool  that checks for undefined behavior
> and
> > > > > highlights to help avoid debates like this in the future.
> > > > >
> > > > > Cheers,
> > > > > Micah
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Feb 3, 2025 at 6:48 AM Gang Wu <ust...@gmail.com> wrote:
> > > > >
> > > > > > I agree that ambiguity is not good. The spec is clear that writer
> > > > > > implementations should not write these values. However, it is
> > tricky
> > > > > > on the reader side. IMHO, it is similar to the case of schema
> > > evolution
> > > > > > when we down cast int32 to int8. It is engine-specific to throw
> an
> > > error
> > > > > > or return the lower 8 bits, and some engines use wider types to
> > > return
> > > > > > the exact values.
> > > > > >
> > > > > > On Mon, Feb 3, 2025 at 6:28 PM Raphael Taylor-Davies
> > > > > > <r.taylordav...@googlemail.com.invalid> wrote:
> > > > > >
> > > > > > > Right, but the purpose of the specification is to say what
> > > > > > > implementations should do, not necessarily what they do
> > currently,
> > > and
> > > > > > > that way over time implementations will converge.
> > > > > > >
> > > > > > > There will always be non-spec compliant implementations, the
> > point
> > > of a
> > > > > > > specification is to define what correct looks like. If
> > > implementations
> > > > > > > do something different, that is then a problem for those
> > > > > implementations
> > > > > > > to fix, and not every other implementation to accommodate.
> > > Ambiguity in
> > > > > > > the specification just forces every implementation to make
> value
> > > > > > > judgements about what non-standardized behaviour they should
> > > choose to
> > > > > > > have, which seems unfortunate.
> > > > > > >
> > > > > > > Kind Regards,
> > > > > > >
> > > > > > > Raphael
> > > > > > >
> > > > > > > On 03/02/2025 10:16, Gang Wu wrote:
> > > > > > > > The problem is that we cannot control all the implementations
> > in
> > > the
> > > > > > > wild.
> > > > > > > >
> > > > > > > > Therefore we can only suggest that writers should not write
> > such
> > > kind
> > > > > > of
> > > > > > > > data and readers are free to return an error or anything.
> > > > > > > >
> > > > > > > > On Mon, Feb 3, 2025 at 6:05 PM Raphael Taylor-Davies
> > > > > > > > <r.taylordav...@googlemail.com.invalid> wrote:
> > > > > > > >
> > > > > > > >> The challenge with undefined behaviour is it inherently
> leads
> > to
> > > > > > > >> incompatibilities, which in turn wastes people's time
> chasing
> > > > > ghosts.
> > > > > > > >>
> > > > > > > >> In this case I would suggest explicitly disallowing such
> data,
> > > and
> > > > > > > >> encourage implementations to return an error. This is
> simple,
> > > avoids
> > > > > > > adding
> > > > > > > >> complexity to the specification, and avoids the potential
> can
> > > of
> > > > > worms
> > > > > > > as
> > > > > > > >> to how to apply this masking logic for statistics, bloom
> > > filters,
> > > > > > etc...
> > > > > > > >>
> > > > > > > >> I don't believe any real parquet writer will produce such
> > data,
> > > and
> > > > > > so I
> > > > > > > >> think we should just take the simplest option of not
> allowing
> > > it.
> > > > > > > >>
> > > > > > > >> Kind Regards,
> > > > > > > >>
> > > > > > > >> Raphael
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On 3 February 2025 09:25:42 GMT, Gang Wu <ust...@gmail.com>
> > > wrote:
> > > > > > > >>> I'm inclined to leave it as an undefined behavior from the
> > > > > > perspective
> > > > > > > of
> > > > > > > >>> spec to keep the spec simple.
> > > > > > > >>>
> > > > > > > >>> On Mon, Feb 3, 2025 at 4:57 PM Alkis Evlogimenos
> > > > > > > >>> <alkis.evlogime...@databricks.com.invalid> wrote:
> > > > > > > >>>
> > > > > > > >>>> Wouldn't it be better to return 238 in this case? What
> does
> > > it
> > > > > mean
> > > > > > > for
> > > > > > > >>>> parquet-java to return -18 when the logical type is UINT8?
> > > > > > > >>>>
> > > > > > > >>>> Thinking a bit further is this something we perhaps want
> to
> > > > > specify?
> > > > > > > >>>> Something like:
> > > > > > > >>>> - if the stored value is larger than the maximum allowed
> by
> > > the
> > > > > > > >> annotation
> > > > > > > >>>> only the lower N bits are taking into account
> > > > > > > >>>> - if the stored value is smaller than the maximum allowed
> by
> > > the
> > > > > > > >> annotation
> > > > > > > >>>> the value is sign-extended if signed otherwise
> zero-extended
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> On Sat, Feb 1, 2025 at 5:00 PM Andrew Lamb <
> > > > > andrewlam...@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >>>>> In my opinion, the more consistently they are handled the
> > > better
> > > > > > the
> > > > > > > >>>>> ecosystem as a whole would be (we would waste less time
> > > chasing
> > > > > > down
> > > > > > > >>>>> seeming incompatibilities)
> > > > > > > >>>>>
> > > > > > > >>>>> Given its predominance in the ecosystem I would
> personally
> > > > > suggest
> > > > > > > >>>> updating
> > > > > > > >>>>> the other readers to follow the parquet-vava
> implementation
> > > if
> > > > > > > >> practical.
> > > > > > > >>>>> Thanks,
> > > > > > > >>>>> Andrew
> > > > > > > >>>>>
> > > > > > > >>>>> On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl <
> etse...@live.com
> > >
> > >
> > > > > wrote:
> > > > > > > >>>>>
> > > > > > > >>>>>> An issue was recently raised [1] in arrow-rs questioning
> > > the
> > > > > > reading
> > > > > > > >>>> of a
> > > > > > > >>>>>> file that had improperly encoded UINT_8 and UINT_16
> > > columns. For
> > > > > > > >>>>> instance,
> > > > > > > >>>>>> a UINT_8 value of 238 (0xee) was plain encoded as
> > > 0xffffffee.
> > > > > When
> > > > > > > >> read
> > > > > > > >>>>> by
> > > > > > > >>>>>> parquet-rs, a value of null was returned. For the same
> > > file,
> > > > > > > >>>> parquet-java
> > > > > > > >>>>>> (well, parquet-cli cat) returned -18, and arrow-cpp
> > > returned
> > > > > 238.
> > > > > > > >>>>>>
> > > > > > > >>>>>> The Parquet specification [2] states that behavior in
> this
> > > case
> > > > > is
> > > > > > > >>>>>> undefined, so all three readers are correct. I'm just
> > > wondering
> > > > > if
> > > > > > > >>>> there
> > > > > > > >>>>> is
> > > > > > > >>>>>> any desire in the community to suggest handling such
> > > malformed
> > > > > > data
> > > > > > > >> in
> > > > > > > >>>> a
> > > > > > > >>>>>> more consistent fashion, or just leave UB as UB.
> > > > > > > >>>>>>
> > > > > > > >>>>>> Thanks,
> > > > > > > >>>>>> Ed
> > > > > > > >>>>>>
> > > > > > > >>>>>> [1] https://github.com/apache/arrow-rs/issues/7040
> > > > > > > >>>>>> [2]
> > > > > > > >>>>>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers
> > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > >
> > >
> > >
> >
>

Re: [DISCUSS] Handling of malformed UINT8 and UINT16 data

Reply via email to