Re: [DISCUSS] Handling of malformed UINT8 and UINT16 data

Antoine Pitrou Tue, 04 Feb 2025 02:44:41 -0800


If we want to recommend something, then we should certainly recommend
returning an error.


Regards

Antoine.


On Tue, 4 Feb 2025 01:19:14 +0000
Ed Seidl <etse...@live.com> wrote:
> So far I’ve tested parquet-rs, pyarrow/arrow-cpp, and parquet-java. None of 
> them return an error. Parquet-rs will return either null or 238 depending on 
> which API is used, pyarrow returns 238, and spark/parquet-java returns -18.
> 
> If we don’t want to mandate that readers must ignore bits outside the 
> specified range, then I agree with Raphael that it’s probably best to return 
> an error here. Then at least users will know there’s a problem with how the 
> file has been produced.
> 
> Ed
> 
> On 2/3/25, 11:33 AM, "Steve Loughran" <ste...@cloudera.com.INVALID> wrote:
> 
> > Is the invalid data detected and reported as such by the different parsers
> > -or does it get returned and the apps are then expected to notice the
> > problem?  
> 
> > as that reporting of problems is what i care about.  
> 
> 
> 
> 
> 
> > On Mon, 3 Feb 2025 at 16:53, Micah Kornfield <emkornfi...@gmail.com> wrote: 
> >  
> 
> > Could we take a step back? Is this a real bug (and at what layer), it seems
> > that at least paquet-java is not validating data being passed to it when
> > writing? Are there other known implementations that write this malformed
> > data?
> >
> > On the topic of whether to keep behavior undefined, having gone through a
> > very long discussion on possible bugs for variant shredding, I see very
> > similar themes discussed  here.  It would be good to come to a consensus as
> > a community on the general topic of undefined behavior and how we intend to
> > handle it (maybe we can fork this thread).
> >
> > My two-sense.  I think retroactively defining behavior is not a great idea
> > but in some cases might be warranted.  My preferred approach is keeping
> > notes someplace on known bugs of writers in a centralized location and
> > letting readers decide on how to handle it makes sense (we for instance do
> > this for some cases when we know stats are incorrect).  I also think we
> > really need a validation tool  that checks for undefined behavior and
> > highlights to help avoid debates like this in the future.
> >
> > Cheers,
> > Micah
> >
> >
> >
> > On Mon, Feb 3, 2025 at 6:48 AM Gang Wu <ust...@gmail.com> wrote:
> >  
> > > I agree that ambiguity is not good. The spec is clear that writer
> > > implementations should not write these values. However, it is tricky
> > > on the reader side. IMHO, it is similar to the case of schema evolution
> > > when we down cast int32 to int8. It is engine-specific to throw an error
> > > or return the lower 8 bits, and some engines use wider types to return
> > > the exact values.
> > >
> > > On Mon, Feb 3, 2025 at 6:28 PM Raphael Taylor-Davies
> > > <r.taylordav...@googlemail.com.invalid> wrote:
> > >  
> > > > Right, but the purpose of the specification is to say what
> > > > implementations should do, not necessarily what they do currently, and
> > > > that way over time implementations will converge.
> > > >
> > > > There will always be non-spec compliant implementations, the point of a
> > > > specification is to define what correct looks like. If implementations
> > > > do something different, that is then a problem for those  
> > implementations  
> > > > to fix, and not every other implementation to accommodate. Ambiguity in
> > > > the specification just forces every implementation to make value
> > > > judgements about what non-standardized behaviour they should choose to
> > > > have, which seems unfortunate.
> > > >
> > > > Kind Regards,
> > > >
> > > > Raphael
> > > >
> > > > On 03/02/2025 10:16, Gang Wu wrote:  
> > > > > The problem is that we cannot control all the implementations in the  
> > > > wild.  
> > > > >
> > > > > Therefore we can only suggest that writers should not write such kind 
> > > > >  
> > > of  
> > > > > data and readers are free to return an error or anything.
> > > > >
> > > > > On Mon, Feb 3, 2025 at 6:05 PM Raphael Taylor-Davies
> > > > > <r.taylordav...@googlemail.com.invalid> wrote:
> > > > >  
> > > > >> The challenge with undefined behaviour is it inherently leads to
> > > > >> incompatibilities, which in turn wastes people's time chasing  
> > ghosts.  
> > > > >>
> > > > >> In this case I would suggest explicitly disallowing such data, and
> > > > >> encourage implementations to return an error. This is simple, avoids 
> > > > >>  
> > > > adding  
> > > > >> complexity to the specification, and avoids the potential can of  
> > worms  
> > > > as  
> > > > >> to how to apply this masking logic for statistics, bloom filters,  
> > > etc...  
> > > > >>
> > > > >> I don't believe any real parquet writer will produce such data, and  
> > > so I  
> > > > >> think we should just take the simplest option of not allowing it.
> > > > >>
> > > > >> Kind Regards,
> > > > >>
> > > > >> Raphael
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 3 February 2025 09:25:42 GMT, Gang Wu <ust...@gmail.com> wrote:  
> > > > >>> I'm inclined to leave it as an undefined behavior from the  
> > > perspective  
> > > > of  
> > > > >>> spec to keep the spec simple.
> > > > >>>
> > > > >>> On Mon, Feb 3, 2025 at 4:57 PM Alkis Evlogimenos
> > > > >>> <alkis.evlogime...@databricks.com.invalid> wrote:
> > > > >>>  
> > > > >>>> Wouldn't it be better to return 238 in this case? What does it  
> > mean  
> > > > for  
> > > > >>>> parquet-java to return -18 when the logical type is UINT8?
> > > > >>>>
> > > > >>>> Thinking a bit further is this something we perhaps want to  
> > specify?  
> > > > >>>> Something like:
> > > > >>>> - if the stored value is larger than the maximum allowed by the  
> > > > >> annotation  
> > > > >>>> only the lower N bits are taking into account
> > > > >>>> - if the stored value is smaller than the maximum allowed by the  
> > > > >> annotation  
> > > > >>>> the value is sign-extended if signed otherwise zero-extended
> > > > >>>>
> > > > >>>>
> > > > >>>> On Sat, Feb 1, 2025 at 5:00 PM Andrew Lamb <  
> > andrewlam...@gmail.com>  
> > > > >> wrote:  
> > > > >>>>> In my opinion, the more consistently they are handled the better  
> > > the  
> > > > >>>>> ecosystem as a whole would be (we would waste less time chasing  
> > > down  
> > > > >>>>> seeming incompatibilities)
> > > > >>>>>
> > > > >>>>> Given its predominance in the ecosystem I would personally  
> > suggest  
> > > > >>>> updating  
> > > > >>>>> the other readers to follow the parquet-vava implementation if  
> > > > >> practical.  
> > > > >>>>> Thanks,
> > > > >>>>> Andrew
> > > > >>>>>
> > > > >>>>> On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl <etse...@live.com>  
> > wrote:  
> > > > >>>>>  
> > > > >>>>>> An issue was recently raised [1] in arrow-rs questioning the  
> > > reading  
> > > > >>>> of a  
> > > > >>>>>> file that had improperly encoded UINT_8 and UINT_16 columns. For 
> > > > >>>>>>  
> > > > >>>>> instance,  
> > > > >>>>>> a UINT_8 value of 238 (0xee) was plain encoded as 0xffffffee.  
> > When  
> > > > >> read  
> > > > >>>>> by  
> > > > >>>>>> parquet-rs, a value of null was returned. For the same file,  
> > > > >>>> parquet-java  
> > > > >>>>>> (well, parquet-cli cat) returned -18, and arrow-cpp returned  
> > 238.  
> > > > >>>>>>
> > > > >>>>>> The Parquet specification [2] states that behavior in this case  
> > is  
> > > > >>>>>> undefined, so all three readers are correct. I'm just wondering  
> > if  
> > > > >>>> there  
> > > > >>>>> is  
> > > > >>>>>> any desire in the community to suggest handling such malformed  
> > > data  
> > > > >> in  
> > > > >>>> a  
> > > > >>>>>> more consistent fashion, or just leave UB as UB.
> > > > >>>>>>
> > > > >>>>>> Thanks,
> > > > >>>>>> Ed
> > > > >>>>>>
> > > > >>>>>> [1] https://github.com/apache/arrow-rs/issues/7040
> > > > >>>>>> [2]
> > > > >>>>>>  
> > > > >>  
> > > >  
> > >  
> > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers
> >   
> > > >  
> > >  
> >

Re: [DISCUSS] Handling of malformed UINT8 and UINT16 data

Reply via email to