I'm inclined to leave it as an undefined behavior from the perspective of spec to keep the spec simple.
On Mon, Feb 3, 2025 at 4:57 PM Alkis Evlogimenos <alkis.evlogime...@databricks.com.invalid> wrote: > Wouldn't it be better to return 238 in this case? What does it mean for > parquet-java to return -18 when the logical type is UINT8? > > Thinking a bit further is this something we perhaps want to specify? > Something like: > - if the stored value is larger than the maximum allowed by the annotation > only the lower N bits are taking into account > - if the stored value is smaller than the maximum allowed by the annotation > the value is sign-extended if signed otherwise zero-extended > > > On Sat, Feb 1, 2025 at 5:00 PM Andrew Lamb <andrewlam...@gmail.com> wrote: > > > In my opinion, the more consistently they are handled the better the > > ecosystem as a whole would be (we would waste less time chasing down > > seeming incompatibilities) > > > > Given its predominance in the ecosystem I would personally suggest > updating > > the other readers to follow the parquet-vava implementation if practical. > > > > Thanks, > > Andrew > > > > On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl <etse...@live.com> wrote: > > > > > An issue was recently raised [1] in arrow-rs questioning the reading > of a > > > file that had improperly encoded UINT_8 and UINT_16 columns. For > > instance, > > > a UINT_8 value of 238 (0xee) was plain encoded as 0xffffffee. When read > > by > > > parquet-rs, a value of null was returned. For the same file, > parquet-java > > > (well, parquet-cli cat) returned -18, and arrow-cpp returned 238. > > > > > > The Parquet specification [2] states that behavior in this case is > > > undefined, so all three readers are correct. I'm just wondering if > there > > is > > > any desire in the community to suggest handling such malformed data in > a > > > more consistent fashion, or just leave UB as UB. > > > > > > Thanks, > > > Ed > > > > > > [1] https://github.com/apache/arrow-rs/issues/7040 > > > [2] > > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers > > > > > >