Hi all,

I ran into a small issue while implementing a DELTA_BINARY_PACKED encoder. The 
problem arises when the parquet physical type is INT32, and the deltas exceed 
what a 32-bit integer can represent. The strategy used by most writers is to 
have two encoder implementations, one for 32 bit and one for 64, so when the 
32-bit encoder hits the above situation, it relies on well-defined overflow and 
goes about its business, resulting in a block of data where the encoding bit 
width is 32. My implementation instead always uses 64 bits for the encoding, 
and winds up using 33 bits rather then overflowing. Clearly this is not 
desirable behavior (delta encoding shouldn’t produce data larger than just 
plain encoding!), but I didn’t catch it early on because parquet-mr can 
actually read the data encoded with 33 bits just fine.



I plan on fixing my implementation, but I’m wondering if the Parquet 
specification should be modified to either a) forbid using more bits than the 
physical type, or b) add verbiage to the effect that writers should not use 
more bits than the physical type, but readers should be able to handle that.



Thoughts?



Thanks,

Ed


PS Sorry if this is a duplicate...I tried sending yesterday from a different 
email address but that didn't appear to get through.

Reply via email to