Re: [DISCUSS] Handling of malformed UINT8 and UINT16 data

Liu Xianyang Tue, 11 Feb 2025 22:22:01 -0800

If we were to add more checks to determine whether the data being written is 
valid, it would inevitably impact performance. I am inclined to support keeping 
the default behavior undefined, but providing tools to assist with checking and 
verification would also be a good option.



> 2025年2月11日 01:04，Parth Chandra <par...@apache.org> 写道：
> 
> On Sun, Feb 9, 2025 at 12:39 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> 
>> Based on discussion on some of the issues it sounds like there are no known
>> production systems that produce this sort of invalid files (it is only a
>> test harness).
>> 
> 
> The test harness merely demonstrates the issue. It does not eliminate the
> possibility of the same issue
> existing elsewhere.
> ExampleParquetWriter itself simply extends ParquetWriter which is a public
> class and could be the basis for any
> production system.
> Spark and pyarrow produce different results for such values. Spark itself
> does not have unsigned ints but it specifically
> implemented support for unsigned ints. I'm only guessing that the
> requirement must have come from some non-synthetic use case.
> 
> Parth
> 
> 
>> On Tue, Feb 4, 2025 at 8:44 PM Parth Chandra <par...@apache.org> wrote:
>> 
>>> I logged the original issue in arrow-rs and a corresponding issue in
>>> Parquet [1] while working with Datafusion Comet.
>>> For background, Comet uses the ExampleParquetWriter to write a Parquet
>> file
>>> used by its unit tests which compare the results returned by Comet to the
>>> results returned by Spark.
>>> @Steve Loughran <ste...@cloudera.com> asked a relevant question -
>>>> Is the invalid data detected and reported as such by the different
>> parsers
>>>> -or does it get returned and the apps are then expected to notice the
>>>> problem?
>>> The problem is exactly that the different implementations return
>> different
>>> results and expect the applications to handle it.
>>> 
>>> I don't have a binding say in this matter but any decision that results
>> in
>>> consistent behavior would get my vote.
>>> 
>>> - Parth
>>> 
>>> [1]  https://github.com/apache/parquet-java/issues/3142
>>> 
>>> On Tue, Feb 4, 2025 at 2:44 AM Antoine Pitrou <anto...@python.org>
>> wrote:
>>> 
>>>> 
>>>> If we want to recommend something, then we should certainly recommend
>>>> returning an error.
>>>> 
>>>> Regards
>>>> 
>>>> Antoine.
>>>> 
>>>> 
>>>> On Tue, 4 Feb 2025 01:19:14 +0000
>>>> Ed Seidl <etse...@live.com> wrote:
>>>>> So far I’ve tested parquet-rs, pyarrow/arrow-cpp, and parquet-java.
>>> None
>>>> of them return an error. Parquet-rs will return either null or 238
>>>> depending on which API is used, pyarrow returns 238, and
>>> spark/parquet-java
>>>> returns -18.
>>>>> 
>>>>> If we don’t want to mandate that readers must ignore bits outside the
>>>> specified range, then I agree with Raphael that it’s probably best to
>>>> return an error here. Then at least users will know there’s a problem
>>> with
>>>> how the file has been produced.
>>>>> 
>>>>> Ed
>>>>> 
>>>>> On 2/3/25, 11:33 AM, "Steve Loughran" <ste...@cloudera.com.INVALID>
>>>> wrote:
>>>>> 
>>>>>> Is the invalid data detected and reported as such by the different
>>>> parsers
>>>>>> -or does it get returned and the apps are then expected to notice
>> the
>>>>>> problem?
>>>>> 
>>>>>> as that reporting of problems is what i care about.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Mon, 3 Feb 2025 at 16:53, Micah Kornfield <
>> emkornfi...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Could we take a step back? Is this a real bug (and at what layer),
>> it
>>>> seems
>>>>>> that at least paquet-java is not validating data being passed to it
>>>> when
>>>>>> writing? Are there other known implementations that write this
>>>> malformed
>>>>>> data?
>>>>>> 
>>>>>> On the topic of whether to keep behavior undefined, having gone
>>>> through a
>>>>>> very long discussion on possible bugs for variant shredding, I see
>>> very
>>>>>> similar themes discussed  here.  It would be good to come to a
>>>> consensus as
>>>>>> a community on the general topic of undefined behavior and how we
>>>> intend to
>>>>>> handle it (maybe we can fork this thread).
>>>>>> 
>>>>>> My two-sense.  I think retroactively defining behavior is not a
>> great
>>>> idea
>>>>>> but in some cases might be warranted.  My preferred approach is
>>> keeping
>>>>>> notes someplace on known bugs of writers in a centralized location
>>> and
>>>>>> letting readers decide on how to handle it makes sense (we for
>>>> instance do
>>>>>> this for some cases when we know stats are incorrect).  I also
>> think
>>> we
>>>>>> really need a validation tool  that checks for undefined behavior
>> and
>>>>>> highlights to help avoid debates like this in the future.
>>>>>> 
>>>>>> Cheers,
>>>>>> Micah
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Feb 3, 2025 at 6:48 AM Gang Wu <ust...@gmail.com> wrote:
>>>>>> 
>>>>>>> I agree that ambiguity is not good. The spec is clear that writer
>>>>>>> implementations should not write these values. However, it is
>>> tricky
>>>>>>> on the reader side. IMHO, it is similar to the case of schema
>>>> evolution
>>>>>>> when we down cast int32 to int8. It is engine-specific to throw
>> an
>>>> error
>>>>>>> or return the lower 8 bits, and some engines use wider types to
>>>> return
>>>>>>> the exact values.
>>>>>>> 
>>>>>>> On Mon, Feb 3, 2025 at 6:28 PM Raphael Taylor-Davies
>>>>>>> <r.taylordav...@googlemail.com.invalid> wrote:
>>>>>>> 
>>>>>>>> Right, but the purpose of the specification is to say what
>>>>>>>> implementations should do, not necessarily what they do
>>> currently,
>>>> and
>>>>>>>> that way over time implementations will converge.
>>>>>>>> 
>>>>>>>> There will always be non-spec compliant implementations, the
>>> point
>>>> of a
>>>>>>>> specification is to define what correct looks like. If
>>>> implementations
>>>>>>>> do something different, that is then a problem for those
>>>>>> implementations
>>>>>>>> to fix, and not every other implementation to accommodate.
>>>> Ambiguity in
>>>>>>>> the specification just forces every implementation to make
>> value
>>>>>>>> judgements about what non-standardized behaviour they should
>>>> choose to
>>>>>>>> have, which seems unfortunate.
>>>>>>>> 
>>>>>>>> Kind Regards,
>>>>>>>> 
>>>>>>>> Raphael
>>>>>>>> 
>>>>>>>> On 03/02/2025 10:16, Gang Wu wrote:
>>>>>>>>> The problem is that we cannot control all the implementations
>>> in
>>>> the
>>>>>>>> wild.
>>>>>>>>> 
>>>>>>>>> Therefore we can only suggest that writers should not write
>>> such
>>>> kind
>>>>>>> of
>>>>>>>>> data and readers are free to return an error or anything.
>>>>>>>>> 
>>>>>>>>> On Mon, Feb 3, 2025 at 6:05 PM Raphael Taylor-Davies
>>>>>>>>> <r.taylordav...@googlemail.com.invalid> wrote:
>>>>>>>>> 
>>>>>>>>>> The challenge with undefined behaviour is it inherently
>> leads
>>> to
>>>>>>>>>> incompatibilities, which in turn wastes people's time
>> chasing
>>>>>> ghosts.
>>>>>>>>>> 
>>>>>>>>>> In this case I would suggest explicitly disallowing such
>> data,
>>>> and
>>>>>>>>>> encourage implementations to return an error. This is
>> simple,
>>>> avoids
>>>>>>>> adding
>>>>>>>>>> complexity to the specification, and avoids the potential
>> can
>>>> of
>>>>>> worms
>>>>>>>> as
>>>>>>>>>> to how to apply this masking logic for statistics, bloom
>>>> filters,
>>>>>>> etc...
>>>>>>>>>> 
>>>>>>>>>> I don't believe any real parquet writer will produce such
>>> data,
>>>> and
>>>>>>> so I
>>>>>>>>>> think we should just take the simplest option of not
>> allowing
>>>> it.
>>>>>>>>>> 
>>>>>>>>>> Kind Regards,
>>>>>>>>>> 
>>>>>>>>>> Raphael
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 3 February 2025 09:25:42 GMT, Gang Wu <ust...@gmail.com>
>>>> wrote:
>>>>>>>>>>> I'm inclined to leave it as an undefined behavior from the
>>>>>>> perspective
>>>>>>>> of
>>>>>>>>>>> spec to keep the spec simple.
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Feb 3, 2025 at 4:57 PM Alkis Evlogimenos
>>>>>>>>>>> <alkis.evlogime...@databricks.com.invalid> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Wouldn't it be better to return 238 in this case? What
>> does
>>>> it
>>>>>> mean
>>>>>>>> for
>>>>>>>>>>>> parquet-java to return -18 when the logical type is UINT8?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thinking a bit further is this something we perhaps want
>> to
>>>>>> specify?
>>>>>>>>>>>> Something like:
>>>>>>>>>>>> - if the stored value is larger than the maximum allowed
>> by
>>>> the
>>>>>>>>>> annotation
>>>>>>>>>>>> only the lower N bits are taking into account
>>>>>>>>>>>> - if the stored value is smaller than the maximum allowed
>> by
>>>> the
>>>>>>>>>> annotation
>>>>>>>>>>>> the value is sign-extended if signed otherwise
>> zero-extended
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Feb 1, 2025 at 5:00 PM Andrew Lamb <
>>>>>> andrewlam...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> In my opinion, the more consistently they are handled the
>>>> better
>>>>>>> the
>>>>>>>>>>>>> ecosystem as a whole would be (we would waste less time
>>>> chasing
>>>>>>> down
>>>>>>>>>>>>> seeming incompatibilities)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Given its predominance in the ecosystem I would
>> personally
>>>>>> suggest
>>>>>>>>>>>> updating
>>>>>>>>>>>>> the other readers to follow the parquet-vava
>> implementation
>>>> if
>>>>>>>>>> practical.
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl <
>> etse...@live.com
>>>> 
>>>> 
>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> An issue was recently raised [1] in arrow-rs questioning
>>>> the
>>>>>>> reading
>>>>>>>>>>>> of a
>>>>>>>>>>>>>> file that had improperly encoded UINT_8 and UINT_16
>>>> columns. For
>>>>>>>>>>>>> instance,
>>>>>>>>>>>>>> a UINT_8 value of 238 (0xee) was plain encoded as
>>>> 0xffffffee.
>>>>>> When
>>>>>>>>>> read
>>>>>>>>>>>>> by
>>>>>>>>>>>>>> parquet-rs, a value of null was returned. For the same
>>>> file,
>>>>>>>>>>>> parquet-java
>>>>>>>>>>>>>> (well, parquet-cli cat) returned -18, and arrow-cpp
>>>> returned
>>>>>> 238.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The Parquet specification [2] states that behavior in
>> this
>>>> case
>>>>>> is
>>>>>>>>>>>>>> undefined, so all three readers are correct. I'm just
>>>> wondering
>>>>>> if
>>>>>>>>>>>> there
>>>>>>>>>>>>> is
>>>>>>>>>>>>>> any desire in the community to suggest handling such
>>>> malformed
>>>>>>> data
>>>>>>>>>> in
>>>>>>>>>>>> a
>>>>>>>>>>>>>> more consistent fashion, or just leave UB as UB.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Ed
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [1] https://github.com/apache/arrow-rs/issues/7040
>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers
>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>

Re: [DISCUSS] Handling of malformed UINT8 and UINT16 data

Reply via email to