Hi Ed, My concern for changing specs is that existing writer implementations have already produced parquet files that the change intends to avoid. So it would be a long time to deprecate the old writers while any reader implementation should always be able to decode legacy files.
Best, Gang On Wed, Oct 25, 2023 at 12:52 AM Edward Seidl <etse...@live.com> wrote: > Hi all, > > I ran into a small issue while implementing a DELTA_BINARY_PACKED encoder. > The problem arises when the parquet physical type is INT32, and the deltas > exceed what a 32-bit integer can represent. The strategy used by most > writers is to have two encoder implementations, one for 32 bit and one for > 64, so when the 32-bit encoder hits the above situation, it relies on > well-defined overflow and goes about its business, resulting in a block of > data where the encoding bit width is 32. My implementation instead always > uses 64 bits for the encoding, and winds up using 33 bits rather then > overflowing. Clearly this is not desirable behavior (delta encoding > shouldn’t produce data larger than just plain encoding!), but I didn’t > catch it early on because parquet-mr can actually read the data encoded > with 33 bits just fine. > > > > I plan on fixing my implementation, but I’m wondering if the Parquet > specification should be modified to either a) forbid using more bits than > the physical type, or b) add verbiage to the effect that writers should not > use more bits than the physical type, but readers should be able to handle > that. > > > > Thoughts? > > > > Thanks, > > Ed > > > PS Sorry if this is a duplicate...I tried sending yesterday from a > different email address but that didn't appear to get through. >