Hi Ed,

As [1] said, the DELTA_BINARY_PACKED might not be suitable
for all cases. [3] Also talks about the same problem. I think
because we already have data like this. This could be compatible.

Also, [2] introduce some optimizations about DELTA_BINARY_PACED.
Besides, maybe we can introduce PFor mentioned in [1] in the future.

[1] https://arxiv.org/pdf/1209.2137v5.pdf
[2] https://github.com/apache/arrow-rs/issues/2282
[3] https://github.com/apache/arrow/issues/20374

Best,
Xuwei Fu

Edward Seidl <etse...@live.com> 于2023年10月25日周三 00:51写道:

> Hi all,
>
> I ran into a small issue while implementing a DELTA_BINARY_PACKED encoder.
> The problem arises when the parquet physical type is INT32, and the deltas
> exceed what a 32-bit integer can represent. The strategy used by most
> writers is to have two encoder implementations, one for 32 bit and one for
> 64, so when the 32-bit encoder hits the above situation, it relies on
> well-defined overflow and goes about its business, resulting in a block of
> data where the encoding bit width is 32. My implementation instead always
> uses 64 bits for the encoding, and winds up using 33 bits rather then
> overflowing. Clearly this is not desirable behavior (delta encoding
> shouldn’t produce data larger than just plain encoding!), but I didn’t
> catch it early on because parquet-mr can actually read the data encoded
> with 33 bits just fine.
>
>
>
> I plan on fixing my implementation, but I’m wondering if the Parquet
> specification should be modified to either a) forbid using more bits than
> the physical type, or b) add verbiage to the effect that writers should not
> use more bits than the physical type, but readers should be able to handle
> that.
>
>
>
> Thoughts?
>
>
>
> Thanks,
>
> Ed
>
>
> PS Sorry if this is a duplicate...I tried sending yesterday from a
> different email address but that didn't appear to get through.
>

Reply via email to