rustyconover opened a new issue, #8242: URL: https://github.com/apache/iceberg/issues/8242
### Apache Iceberg version 1.3.1 (latest release) ### Query engine Other ### Please describe the bug 🐞 Hi @Fokko, When testing Avro integer parsing I was discovered that the value 2^63 would not be correctly decoded. The 2^63 value would be returned as -9223372036854775809 instead of 92233720368547758. Since integer decoding is used to plan scans in Python if you have a column which a value this large it means that the scan may return incorrect results and miss data. Looking at other references of ZigZag encoding around the web specifically Google's protobuf, shows that the Python implementation of Protobuf doesn't also decode 2^63 correctly either. Their tests only test the code up to 2**60. https://github.com/protocolbuffers/protobuf/blob/5e03386555544e39c21236dca0097123edec8769/python/google/protobuf/internal/decoder.py#L124 Extracting Google's Protobuf that to a smaller test case: ```python def _SignedVarintDecoder(bits, result_type): """Like _VarintDecoder() but decodes signed values.""" signbit = 1 << (bits - 1) mask = (1 << bits) - 1 def DecodeVarint(buffer, pos): result = 0 shift = 0 while 1: b = buffer[pos] result |= ((b & 0x7f) << shift) pos += 1 if not (b & 0x80): result &= mask result = (result ^ signbit) - signbit result = result_type(result) return (result, pos) shift += 7 if shift >= 64: raise ValueError('Too many bytes when decoding varint.') return DecodeVarint _DecodeSignedVarint = _SignedVarintDecoder(64, int) decoder = _SignedVarintDecoder(64, int) # This is the 2**63 encoded as a zigzag encoded varint decoded_value = _DecodeSignedVarint(b'\x81\x80\x80\x80\x80\x80\x80\x80\x80\x02', 0)[0] assert decoded_value == 2**63, f"Decoded={decoded_value} != Expected={2**63}" ``` ``` Traceback (most recent call last): assert decoded_value == 2**63, f"Decoded={decoded_value} != Expected={2**63}" ^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Decoded=1 != Expected=9223372036854775808 ``` Google's protobuf decodes the value as 1, which isn't great either. Testing the Cython based branch I'm working decodes the value as: AssertionError: Decoded value does not match decoded=-1 expected=9223372036854775808 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
