Right, but the purpose of the specification is to say what
implementations should do, not necessarily what they do currently, and
that way over time implementations will converge.
There will always be non-spec compliant implementations, the point of a
specification is to define what correct looks like. If implementations
do something different, that is then a problem for those implementations
to fix, and not every other implementation to accommodate. Ambiguity in
the specification just forces every implementation to make value
judgements about what non-standardized behaviour they should choose to
have, which seems unfortunate.
Kind Regards,
Raphael
On 03/02/2025 10:16, Gang Wu wrote:
The problem is that we cannot control all the implementations in the wild.
Therefore we can only suggest that writers should not write such kind of
data and readers are free to return an error or anything.
On Mon, Feb 3, 2025 at 6:05 PM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:
The challenge with undefined behaviour is it inherently leads to
incompatibilities, which in turn wastes people's time chasing ghosts.
In this case I would suggest explicitly disallowing such data, and
encourage implementations to return an error. This is simple, avoids adding
complexity to the specification, and avoids the potential can of worms as
to how to apply this masking logic for statistics, bloom filters, etc...
I don't believe any real parquet writer will produce such data, and so I
think we should just take the simplest option of not allowing it.
Kind Regards,
Raphael
On 3 February 2025 09:25:42 GMT, Gang Wu <ust...@gmail.com> wrote:
I'm inclined to leave it as an undefined behavior from the perspective of
spec to keep the spec simple.
On Mon, Feb 3, 2025 at 4:57 PM Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:
Wouldn't it be better to return 238 in this case? What does it mean for
parquet-java to return -18 when the logical type is UINT8?
Thinking a bit further is this something we perhaps want to specify?
Something like:
- if the stored value is larger than the maximum allowed by the
annotation
only the lower N bits are taking into account
- if the stored value is smaller than the maximum allowed by the
annotation
the value is sign-extended if signed otherwise zero-extended
On Sat, Feb 1, 2025 at 5:00 PM Andrew Lamb <andrewlam...@gmail.com>
wrote:
In my opinion, the more consistently they are handled the better the
ecosystem as a whole would be (we would waste less time chasing down
seeming incompatibilities)
Given its predominance in the ecosystem I would personally suggest
updating
the other readers to follow the parquet-vava implementation if
practical.
Thanks,
Andrew
On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl <etse...@live.com> wrote:
An issue was recently raised [1] in arrow-rs questioning the reading
of a
file that had improperly encoded UINT_8 and UINT_16 columns. For
instance,
a UINT_8 value of 238 (0xee) was plain encoded as 0xffffffee. When
read
by
parquet-rs, a value of null was returned. For the same file,
parquet-java
(well, parquet-cli cat) returned -18, and arrow-cpp returned 238.
The Parquet specification [2] states that behavior in this case is
undefined, so all three readers are correct. I'm just wondering if
there
is
any desire in the community to suggest handling such malformed data
in
a
more consistent fashion, or just leave UB as UB.
Thanks,
Ed
[1] https://github.com/apache/arrow-rs/issues/7040
[2]
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers