rustyconover opened a new issue, #8242:
URL: https://github.com/apache/iceberg/issues/8242

   ### Apache Iceberg version
   
   1.3.1 (latest release)
   
   ### Query engine
   
   Other
   
   ### Please describe the bug 🐞
   
   Hi @Fokko, 
   
   When testing Avro integer parsing I was discovered that the value 2^63 would 
not be correctly decoded.  The 2^63 value would be returned as 
-9223372036854775809 instead of 92233720368547758.
   
   Since integer decoding is used to plan scans in Python if you have a column 
which a value this large it means that the scan may return incorrect results 
and miss data.
   
   Looking at other references of ZigZag encoding around the web specifically 
Google's protobuf, shows that the Python implementation of Protobuf doesn't 
also decode 2^63 correctly either.  Their tests only test the code up to 2**60.
   
   
https://github.com/protocolbuffers/protobuf/blob/5e03386555544e39c21236dca0097123edec8769/python/google/protobuf/internal/decoder.py#L124
   
   Extracting Google's Protobuf that to a smaller test case:
   
   ```python
   def _SignedVarintDecoder(bits, result_type):
     """Like _VarintDecoder() but decodes signed values."""
   
     signbit = 1 << (bits - 1)
     mask = (1 << bits) - 1
   
     def DecodeVarint(buffer, pos):
       result = 0
       shift = 0
       while 1:
         b = buffer[pos]
         result |= ((b & 0x7f) << shift)
         pos += 1
         if not (b & 0x80):
           result &= mask
           result = (result ^ signbit) - signbit
           result = result_type(result)
           return (result, pos)
         shift += 7
         if shift >= 64:
           raise ValueError('Too many bytes when decoding varint.')
     return DecodeVarint
   
   _DecodeSignedVarint = _SignedVarintDecoder(64, int)
   
   decoder = _SignedVarintDecoder(64, int)
   
   # This is the 2**63 encoded as a zigzag encoded varint
   decoded_value = 
_DecodeSignedVarint(b'\x81\x80\x80\x80\x80\x80\x80\x80\x80\x02', 0)[0]
   assert decoded_value == 2**63, f"Decoded={decoded_value} != Expected={2**63}"
   ```
   
   ```
   Traceback (most recent call last):
       assert decoded_value == 2**63, f"Decoded={decoded_value} != 
Expected={2**63}"
              ^^^^^^^^^^^^^^^^^^^^^^
   AssertionError: Decoded=1 != Expected=9223372036854775808
   ```
   
   Google's protobuf decodes the value as 1, which isn't great either.
   
   Testing the Cython based branch I'm working decodes the value as:
   
   AssertionError: Decoded value does not match decoded=-1 
expected=9223372036854775808
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to