Re: [DISCUSS] Handling of malformed UINT8 and UINT16 data

Raphael Taylor-Davies Mon, 03 Feb 2025 02:26:05 -0800

Right, but the purpose of the specification is to say whatimplementations should do, not necessarily what they do currently, andthat way over time implementations will converge.

There will always be non-spec compliant implementations, the point of aspecification is to define what correct looks like. If implementationsdo something different, that is then a problem for those implementationsto fix, and not every other implementation to accommodate. Ambiguity inthe specification just forces every implementation to make valuejudgements about what non-standardized behaviour they should choose tohave, which seems unfortunate.


Kind Regards,

Raphael

On 03/02/2025 10:16, Gang Wu wrote:

The problem is that we cannot control all the implementations in the wild.

Therefore we can only suggest that writers should not write such kind of
data and readers are free to return an error or anything.

On Mon, Feb 3, 2025 at 6:05 PM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

The challenge with undefined behaviour is it inherently leads to
incompatibilities, which in turn wastes people's time chasing ghosts.

In this case I would suggest explicitly disallowing such data, and
encourage implementations to return an error. This is simple, avoids adding
complexity to the specification, and avoids the potential can of worms as
to how to apply this masking logic for statistics, bloom filters, etc...

I don't believe any real parquet writer will produce such data, and so I
think we should just take the simplest option of not allowing it.

Kind Regards,

Raphael



On 3 February 2025 09:25:42 GMT, Gang Wu <ust...@gmail.com> wrote:

I'm inclined to leave it as an undefined behavior from the perspective of
spec to keep the spec simple.

On Mon, Feb 3, 2025 at 4:57 PM Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:

Wouldn't it be better to return 238 in this case? What does it mean for
parquet-java to return -18 when the logical type is UINT8?

Thinking a bit further is this something we perhaps want to specify?
Something like:
- if the stored value is larger than the maximum allowed by the

annotation

only the lower N bits are taking into account
- if the stored value is smaller than the maximum allowed by the

annotation

the value is sign-extended if signed otherwise zero-extended


On Sat, Feb 1, 2025 at 5:00 PM Andrew Lamb <andrewlam...@gmail.com>

wrote:

In my opinion, the more consistently they are handled the better the
ecosystem as a whole would be (we would waste less time chasing down
seeming incompatibilities)

Given its predominance in the ecosystem I would personally suggest

updating

the other readers to follow the parquet-vava implementation if

practical.

Thanks,
Andrew

On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl <etse...@live.com> wrote:

An issue was recently raised [1] in arrow-rs questioning the reading

of a

file that had improperly encoded UINT_8 and UINT_16 columns. For

instance,

a UINT_8 value of 238 (0xee) was plain encoded as 0xffffffee. When

read

by

parquet-rs, a value of null was returned. For the same file,

parquet-java

(well, parquet-cli cat) returned -18, and arrow-cpp returned 238.

The Parquet specification [2] states that behavior in this case is
undefined, so all three readers are correct. I'm just wondering if

there

is

any desire in the community to suggest handling such malformed data

in

more consistent fashion, or just leave UB as UB.

Thanks,
Ed

[1] https://github.com/apache/arrow-rs/issues/7040
[2]

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers

Re: [DISCUSS] Handling of malformed UINT8 and UINT16 data

Reply via email to