Hi all, I ran into a small issue while implementing a DELTA_BINARY_PACKED encoder. The problem arises when the parquet physical type is INT32, and the deltas exceed what a 32-bit integer can represent. The strategy used by most writers is to have two encoder implementations, one for 32 bit and one for 64, so when the 32-bit encoder hits the above situation, it relies on well-defined overflow and goes about its business, resulting in a block of data where the encoding bit width is 32. My implementation instead always uses 64 bits for the encoding, and winds up using 33 bits rather then overflowing. Clearly this is not desirable behavior (delta encoding shouldn’t produce data larger than just plain encoding!), but I didn’t catch it early on because parquet-mr can actually read the data encoded with 33 bits just fine.
I plan on fixing my implementation, but I’m wondering if the Parquet specification should be modified to either a) forbid using more bits than the physical type, or b) add verbiage to the effect that writers should not use more bits than the physical type, but readers should be able to handle that. Thoughts? Thanks, Ed PS Sorry if this is a duplicate...I tried sending yesterday from a different email address but that didn't appear to get through.