If we were to add more checks to determine whether the data being written is valid, it would inevitably impact performance. I am inclined to support keeping the default behavior undefined, but providing tools to assist with checking and verification would also be a good option.
> 2025年2月11日 01:04,Parth Chandra <par...@apache.org> 写道: > > On Sun, Feb 9, 2025 at 12:39 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Based on discussion on some of the issues it sounds like there are no known >> production systems that produce this sort of invalid files (it is only a >> test harness). >> > > The test harness merely demonstrates the issue. It does not eliminate the > possibility of the same issue > existing elsewhere. > ExampleParquetWriter itself simply extends ParquetWriter which is a public > class and could be the basis for any > production system. > Spark and pyarrow produce different results for such values. Spark itself > does not have unsigned ints but it specifically > implemented support for unsigned ints. I'm only guessing that the > requirement must have come from some non-synthetic use case. > > Parth > > >> On Tue, Feb 4, 2025 at 8:44 PM Parth Chandra <par...@apache.org> wrote: >> >>> I logged the original issue in arrow-rs and a corresponding issue in >>> Parquet [1] while working with Datafusion Comet. >>> For background, Comet uses the ExampleParquetWriter to write a Parquet >> file >>> used by its unit tests which compare the results returned by Comet to the >>> results returned by Spark. >>> @Steve Loughran <ste...@cloudera.com> asked a relevant question - >>>> Is the invalid data detected and reported as such by the different >> parsers >>>> -or does it get returned and the apps are then expected to notice the >>>> problem? >>> The problem is exactly that the different implementations return >> different >>> results and expect the applications to handle it. >>> >>> I don't have a binding say in this matter but any decision that results >> in >>> consistent behavior would get my vote. >>> >>> - Parth >>> >>> [1] https://github.com/apache/parquet-java/issues/3142 >>> >>> On Tue, Feb 4, 2025 at 2:44 AM Antoine Pitrou <anto...@python.org> >> wrote: >>> >>>> >>>> If we want to recommend something, then we should certainly recommend >>>> returning an error. >>>> >>>> Regards >>>> >>>> Antoine. >>>> >>>> >>>> On Tue, 4 Feb 2025 01:19:14 +0000 >>>> Ed Seidl <etse...@live.com> wrote: >>>>> So far I’ve tested parquet-rs, pyarrow/arrow-cpp, and parquet-java. >>> None >>>> of them return an error. Parquet-rs will return either null or 238 >>>> depending on which API is used, pyarrow returns 238, and >>> spark/parquet-java >>>> returns -18. >>>>> >>>>> If we don’t want to mandate that readers must ignore bits outside the >>>> specified range, then I agree with Raphael that it’s probably best to >>>> return an error here. Then at least users will know there’s a problem >>> with >>>> how the file has been produced. >>>>> >>>>> Ed >>>>> >>>>> On 2/3/25, 11:33 AM, "Steve Loughran" <ste...@cloudera.com.INVALID> >>>> wrote: >>>>> >>>>>> Is the invalid data detected and reported as such by the different >>>> parsers >>>>>> -or does it get returned and the apps are then expected to notice >> the >>>>>> problem? >>>>> >>>>>> as that reporting of problems is what i care about. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> On Mon, 3 Feb 2025 at 16:53, Micah Kornfield < >> emkornfi...@gmail.com> >>>> wrote: >>>>> >>>>>> Could we take a step back? Is this a real bug (and at what layer), >> it >>>> seems >>>>>> that at least paquet-java is not validating data being passed to it >>>> when >>>>>> writing? Are there other known implementations that write this >>>> malformed >>>>>> data? >>>>>> >>>>>> On the topic of whether to keep behavior undefined, having gone >>>> through a >>>>>> very long discussion on possible bugs for variant shredding, I see >>> very >>>>>> similar themes discussed here. It would be good to come to a >>>> consensus as >>>>>> a community on the general topic of undefined behavior and how we >>>> intend to >>>>>> handle it (maybe we can fork this thread). >>>>>> >>>>>> My two-sense. I think retroactively defining behavior is not a >> great >>>> idea >>>>>> but in some cases might be warranted. My preferred approach is >>> keeping >>>>>> notes someplace on known bugs of writers in a centralized location >>> and >>>>>> letting readers decide on how to handle it makes sense (we for >>>> instance do >>>>>> this for some cases when we know stats are incorrect). I also >> think >>> we >>>>>> really need a validation tool that checks for undefined behavior >> and >>>>>> highlights to help avoid debates like this in the future. >>>>>> >>>>>> Cheers, >>>>>> Micah >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Feb 3, 2025 at 6:48 AM Gang Wu <ust...@gmail.com> wrote: >>>>>> >>>>>>> I agree that ambiguity is not good. The spec is clear that writer >>>>>>> implementations should not write these values. However, it is >>> tricky >>>>>>> on the reader side. IMHO, it is similar to the case of schema >>>> evolution >>>>>>> when we down cast int32 to int8. It is engine-specific to throw >> an >>>> error >>>>>>> or return the lower 8 bits, and some engines use wider types to >>>> return >>>>>>> the exact values. >>>>>>> >>>>>>> On Mon, Feb 3, 2025 at 6:28 PM Raphael Taylor-Davies >>>>>>> <r.taylordav...@googlemail.com.invalid> wrote: >>>>>>> >>>>>>>> Right, but the purpose of the specification is to say what >>>>>>>> implementations should do, not necessarily what they do >>> currently, >>>> and >>>>>>>> that way over time implementations will converge. >>>>>>>> >>>>>>>> There will always be non-spec compliant implementations, the >>> point >>>> of a >>>>>>>> specification is to define what correct looks like. If >>>> implementations >>>>>>>> do something different, that is then a problem for those >>>>>> implementations >>>>>>>> to fix, and not every other implementation to accommodate. >>>> Ambiguity in >>>>>>>> the specification just forces every implementation to make >> value >>>>>>>> judgements about what non-standardized behaviour they should >>>> choose to >>>>>>>> have, which seems unfortunate. >>>>>>>> >>>>>>>> Kind Regards, >>>>>>>> >>>>>>>> Raphael >>>>>>>> >>>>>>>> On 03/02/2025 10:16, Gang Wu wrote: >>>>>>>>> The problem is that we cannot control all the implementations >>> in >>>> the >>>>>>>> wild. >>>>>>>>> >>>>>>>>> Therefore we can only suggest that writers should not write >>> such >>>> kind >>>>>>> of >>>>>>>>> data and readers are free to return an error or anything. >>>>>>>>> >>>>>>>>> On Mon, Feb 3, 2025 at 6:05 PM Raphael Taylor-Davies >>>>>>>>> <r.taylordav...@googlemail.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> The challenge with undefined behaviour is it inherently >> leads >>> to >>>>>>>>>> incompatibilities, which in turn wastes people's time >> chasing >>>>>> ghosts. >>>>>>>>>> >>>>>>>>>> In this case I would suggest explicitly disallowing such >> data, >>>> and >>>>>>>>>> encourage implementations to return an error. This is >> simple, >>>> avoids >>>>>>>> adding >>>>>>>>>> complexity to the specification, and avoids the potential >> can >>>> of >>>>>> worms >>>>>>>> as >>>>>>>>>> to how to apply this masking logic for statistics, bloom >>>> filters, >>>>>>> etc... >>>>>>>>>> >>>>>>>>>> I don't believe any real parquet writer will produce such >>> data, >>>> and >>>>>>> so I >>>>>>>>>> think we should just take the simplest option of not >> allowing >>>> it. >>>>>>>>>> >>>>>>>>>> Kind Regards, >>>>>>>>>> >>>>>>>>>> Raphael >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 3 February 2025 09:25:42 GMT, Gang Wu <ust...@gmail.com> >>>> wrote: >>>>>>>>>>> I'm inclined to leave it as an undefined behavior from the >>>>>>> perspective >>>>>>>> of >>>>>>>>>>> spec to keep the spec simple. >>>>>>>>>>> >>>>>>>>>>> On Mon, Feb 3, 2025 at 4:57 PM Alkis Evlogimenos >>>>>>>>>>> <alkis.evlogime...@databricks.com.invalid> wrote: >>>>>>>>>>> >>>>>>>>>>>> Wouldn't it be better to return 238 in this case? What >> does >>>> it >>>>>> mean >>>>>>>> for >>>>>>>>>>>> parquet-java to return -18 when the logical type is UINT8? >>>>>>>>>>>> >>>>>>>>>>>> Thinking a bit further is this something we perhaps want >> to >>>>>> specify? >>>>>>>>>>>> Something like: >>>>>>>>>>>> - if the stored value is larger than the maximum allowed >> by >>>> the >>>>>>>>>> annotation >>>>>>>>>>>> only the lower N bits are taking into account >>>>>>>>>>>> - if the stored value is smaller than the maximum allowed >> by >>>> the >>>>>>>>>> annotation >>>>>>>>>>>> the value is sign-extended if signed otherwise >> zero-extended >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Feb 1, 2025 at 5:00 PM Andrew Lamb < >>>>>> andrewlam...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>>>> In my opinion, the more consistently they are handled the >>>> better >>>>>>> the >>>>>>>>>>>>> ecosystem as a whole would be (we would waste less time >>>> chasing >>>>>>> down >>>>>>>>>>>>> seeming incompatibilities) >>>>>>>>>>>>> >>>>>>>>>>>>> Given its predominance in the ecosystem I would >> personally >>>>>> suggest >>>>>>>>>>>> updating >>>>>>>>>>>>> the other readers to follow the parquet-vava >> implementation >>>> if >>>>>>>>>> practical. >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Andrew >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl < >> etse...@live.com >>>> >>>> >>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> An issue was recently raised [1] in arrow-rs questioning >>>> the >>>>>>> reading >>>>>>>>>>>> of a >>>>>>>>>>>>>> file that had improperly encoded UINT_8 and UINT_16 >>>> columns. For >>>>>>>>>>>>> instance, >>>>>>>>>>>>>> a UINT_8 value of 238 (0xee) was plain encoded as >>>> 0xffffffee. >>>>>> When >>>>>>>>>> read >>>>>>>>>>>>> by >>>>>>>>>>>>>> parquet-rs, a value of null was returned. For the same >>>> file, >>>>>>>>>>>> parquet-java >>>>>>>>>>>>>> (well, parquet-cli cat) returned -18, and arrow-cpp >>>> returned >>>>>> 238. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The Parquet specification [2] states that behavior in >> this >>>> case >>>>>> is >>>>>>>>>>>>>> undefined, so all three readers are correct. I'm just >>>> wondering >>>>>> if >>>>>>>>>>>> there >>>>>>>>>>>>> is >>>>>>>>>>>>>> any desire in the community to suggest handling such >>>> malformed >>>>>>> data >>>>>>>>>> in >>>>>>>>>>>> a >>>>>>>>>>>>>> more consistent fashion, or just leave UB as UB. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Ed >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] https://github.com/apache/arrow-rs/issues/7040 >>>>>>>>>>>>>> [2] >>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>> >> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers >>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>>> >>>> >>>> >>> >>