Hi Ed,

My concern for changing specs is that existing writer implementations
have already produced parquet files that the change intends to avoid.
So it would be a long time to deprecate the old writers while any reader
implementation should always be able to decode legacy files.

Best,
Gang

On Wed, Oct 25, 2023 at 12:52 AM Edward Seidl <etse...@live.com> wrote:

> Hi all,
>
> I ran into a small issue while implementing a DELTA_BINARY_PACKED encoder.
> The problem arises when the parquet physical type is INT32, and the deltas
> exceed what a 32-bit integer can represent. The strategy used by most
> writers is to have two encoder implementations, one for 32 bit and one for
> 64, so when the 32-bit encoder hits the above situation, it relies on
> well-defined overflow and goes about its business, resulting in a block of
> data where the encoding bit width is 32. My implementation instead always
> uses 64 bits for the encoding, and winds up using 33 bits rather then
> overflowing. Clearly this is not desirable behavior (delta encoding
> shouldn’t produce data larger than just plain encoding!), but I didn’t
> catch it early on because parquet-mr can actually read the data encoded
> with 33 bits just fine.
>
>
>
> I plan on fixing my implementation, but I’m wondering if the Parquet
> specification should be modified to either a) forbid using more bits than
> the physical type, or b) add verbiage to the effect that writers should not
> use more bits than the physical type, but readers should be able to handle
> that.
>
>
>
> Thoughts?
>
>
>
> Thanks,
>
> Ed
>
>
> PS Sorry if this is a duplicate...I tried sending yesterday from a
> different email address but that didn't appear to get through.
>

Reply via email to