Could we take a step back? Is this a real bug (and at what layer), it seems that at least paquet-java is not validating data being passed to it when writing? Are there other known implementations that write this malformed data?
On the topic of whether to keep behavior undefined, having gone through a very long discussion on possible bugs for variant shredding, I see very similar themes discussed here. It would be good to come to a consensus as a community on the general topic of undefined behavior and how we intend to handle it (maybe we can fork this thread). My two-sense. I think retroactively defining behavior is not a great idea but in some cases might be warranted. My preferred approach is keeping notes someplace on known bugs of writers in a centralized location and letting readers decide on how to handle it makes sense (we for instance do this for some cases when we know stats are incorrect). I also think we really need a validation tool that checks for undefined behavior and highlights to help avoid debates like this in the future. Cheers, Micah On Mon, Feb 3, 2025 at 6:48 AM Gang Wu <ust...@gmail.com> wrote: > I agree that ambiguity is not good. The spec is clear that writer > implementations should not write these values. However, it is tricky > on the reader side. IMHO, it is similar to the case of schema evolution > when we down cast int32 to int8. It is engine-specific to throw an error > or return the lower 8 bits, and some engines use wider types to return > the exact values. > > On Mon, Feb 3, 2025 at 6:28 PM Raphael Taylor-Davies > <r.taylordav...@googlemail.com.invalid> wrote: > > > Right, but the purpose of the specification is to say what > > implementations should do, not necessarily what they do currently, and > > that way over time implementations will converge. > > > > There will always be non-spec compliant implementations, the point of a > > specification is to define what correct looks like. If implementations > > do something different, that is then a problem for those implementations > > to fix, and not every other implementation to accommodate. Ambiguity in > > the specification just forces every implementation to make value > > judgements about what non-standardized behaviour they should choose to > > have, which seems unfortunate. > > > > Kind Regards, > > > > Raphael > > > > On 03/02/2025 10:16, Gang Wu wrote: > > > The problem is that we cannot control all the implementations in the > > wild. > > > > > > Therefore we can only suggest that writers should not write such kind > of > > > data and readers are free to return an error or anything. > > > > > > On Mon, Feb 3, 2025 at 6:05 PM Raphael Taylor-Davies > > > <r.taylordav...@googlemail.com.invalid> wrote: > > > > > >> The challenge with undefined behaviour is it inherently leads to > > >> incompatibilities, which in turn wastes people's time chasing ghosts. > > >> > > >> In this case I would suggest explicitly disallowing such data, and > > >> encourage implementations to return an error. This is simple, avoids > > adding > > >> complexity to the specification, and avoids the potential can of worms > > as > > >> to how to apply this masking logic for statistics, bloom filters, > etc... > > >> > > >> I don't believe any real parquet writer will produce such data, and > so I > > >> think we should just take the simplest option of not allowing it. > > >> > > >> Kind Regards, > > >> > > >> Raphael > > >> > > >> > > >> > > >> On 3 February 2025 09:25:42 GMT, Gang Wu <ust...@gmail.com> wrote: > > >>> I'm inclined to leave it as an undefined behavior from the > perspective > > of > > >>> spec to keep the spec simple. > > >>> > > >>> On Mon, Feb 3, 2025 at 4:57 PM Alkis Evlogimenos > > >>> <alkis.evlogime...@databricks.com.invalid> wrote: > > >>> > > >>>> Wouldn't it be better to return 238 in this case? What does it mean > > for > > >>>> parquet-java to return -18 when the logical type is UINT8? > > >>>> > > >>>> Thinking a bit further is this something we perhaps want to specify? > > >>>> Something like: > > >>>> - if the stored value is larger than the maximum allowed by the > > >> annotation > > >>>> only the lower N bits are taking into account > > >>>> - if the stored value is smaller than the maximum allowed by the > > >> annotation > > >>>> the value is sign-extended if signed otherwise zero-extended > > >>>> > > >>>> > > >>>> On Sat, Feb 1, 2025 at 5:00 PM Andrew Lamb <andrewlam...@gmail.com> > > >> wrote: > > >>>>> In my opinion, the more consistently they are handled the better > the > > >>>>> ecosystem as a whole would be (we would waste less time chasing > down > > >>>>> seeming incompatibilities) > > >>>>> > > >>>>> Given its predominance in the ecosystem I would personally suggest > > >>>> updating > > >>>>> the other readers to follow the parquet-vava implementation if > > >> practical. > > >>>>> Thanks, > > >>>>> Andrew > > >>>>> > > >>>>> On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl <etse...@live.com> wrote: > > >>>>> > > >>>>>> An issue was recently raised [1] in arrow-rs questioning the > reading > > >>>> of a > > >>>>>> file that had improperly encoded UINT_8 and UINT_16 columns. For > > >>>>> instance, > > >>>>>> a UINT_8 value of 238 (0xee) was plain encoded as 0xffffffee. When > > >> read > > >>>>> by > > >>>>>> parquet-rs, a value of null was returned. For the same file, > > >>>> parquet-java > > >>>>>> (well, parquet-cli cat) returned -18, and arrow-cpp returned 238. > > >>>>>> > > >>>>>> The Parquet specification [2] states that behavior in this case is > > >>>>>> undefined, so all three readers are correct. I'm just wondering if > > >>>> there > > >>>>> is > > >>>>>> any desire in the community to suggest handling such malformed > data > > >> in > > >>>> a > > >>>>>> more consistent fashion, or just leave UB as UB. > > >>>>>> > > >>>>>> Thanks, > > >>>>>> Ed > > >>>>>> > > >>>>>> [1] https://github.com/apache/arrow-rs/issues/7040 > > >>>>>> [2] > > >>>>>> > > >> > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers > > >