Hi Ed, As [1] said, the DELTA_BINARY_PACKED might not be suitable for all cases. [3] Also talks about the same problem. I think because we already have data like this. This could be compatible.
Also, [2] introduce some optimizations about DELTA_BINARY_PACED. Besides, maybe we can introduce PFor mentioned in [1] in the future. [1] https://arxiv.org/pdf/1209.2137v5.pdf [2] https://github.com/apache/arrow-rs/issues/2282 [3] https://github.com/apache/arrow/issues/20374 Best, Xuwei Fu Edward Seidl <etse...@live.com> 于2023年10月25日周三 00:51写道: > Hi all, > > I ran into a small issue while implementing a DELTA_BINARY_PACKED encoder. > The problem arises when the parquet physical type is INT32, and the deltas > exceed what a 32-bit integer can represent. The strategy used by most > writers is to have two encoder implementations, one for 32 bit and one for > 64, so when the 32-bit encoder hits the above situation, it relies on > well-defined overflow and goes about its business, resulting in a block of > data where the encoding bit width is 32. My implementation instead always > uses 64 bits for the encoding, and winds up using 33 bits rather then > overflowing. Clearly this is not desirable behavior (delta encoding > shouldn’t produce data larger than just plain encoding!), but I didn’t > catch it early on because parquet-mr can actually read the data encoded > with 33 bits just fine. > > > > I plan on fixing my implementation, but I’m wondering if the Parquet > specification should be modified to either a) forbid using more bits than > the physical type, or b) add verbiage to the effect that writers should not > use more bits than the physical type, but readers should be able to handle > that. > > > > Thoughts? > > > > Thanks, > > Ed > > > PS Sorry if this is a duplicate...I tried sending yesterday from a > different email address but that didn't appear to get through. >