Re: [PR] PARQUET-2435: Clarify behavior of DELTA_BINARY_PACKED encoding [parquet-format]

via GitHub Mon, 26 Feb 2024 03:04:54 -0800


pitrou commented on code in PR #231:
URL: https://github.com/apache/parquet-format/pull/231#discussion_r1502427845



##########
Encodings.md:
##########
@@ -247,6 +253,15 @@ and handled as wrapping around in 2's complement notation 
so that the original
 values are correctly restituted. This may require explicit care in some 
programming
 languages (for example by doing all arithmetic in the unsigned domain).
 
+One strategy that might be employed to avoid the above mentioned overflow is to
+perform the subtraction utilizing integers with a larger number of bits. For 
example,
+while encoding INT32 data one might choose to perform arithmetic operations 
using
+64-bit integers. This can lead to situtations where the number of bits used to 
encode
+the resulting deltas is greater than the number of bits used to represent the 
input
+values. While this behavior is allowed, data produced in this manner may not be

Review Comment:
   I don't think that this behavior is (or should be) allowed. The spec should 
IMHO prescribe that INT32 is encoded at most using 32-bit deltas, and INT64 
using 64-bit deltas. Emitting deltas larger than the physical bitwidth should 
be considered a bug in the encoder.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] PARQUET-2435: Clarify behavior of DELTA_BINARY_PACKED encoding [parquet-format]

Reply via email to