[GitHub] [spark] justaparth commented on pull request #41108: [SPARK-43427] spark protobuf: modify serde behavior of unsigned integer types

via GitHub Sun, 21 May 2023 14:05:47 -0700


justaparth commented on PR #41108:
URL: https://github.com/apache/spark/pull/41108#issuecomment-1556292297


   > Where is the information loss or overflow? Java code generated by Protobuf 
for a uint32 field also returns an `int`, not `long`.
   
   sorry i didn't get a chance to reply to this until now. There is no 
information loss, technically, as uint32 is 4 bytes and uint64 is 8 bytes, same 
as int and long respectively. However, there is overflow in the representation.
   
   Here's an example:
   
   Consider a protobuf message like:
   ```
   syntax = "proto3";
   
   message Test {
     uint64 val = 1;
   }
   ```
   
   Generate a protobuf with a value above 2^63. I did this in python with a 
small script like:
   
   ```
   import test_pb2
   
   s = test_pb2.Test()
   s.val = 9223372036854775809 # 2**63 + 1
   serialized = s.SerializeToString()
   print(serialized)
   ```
   
   This generates the binary representation:
   
   ```
   b'\x08\x81\x80\x80\x80\x80\x80\x80\x80\x80\x01'
   ```
   
   Then, deserialize this using `from_protobuf`. I did this in a notebook so 
its easier to see, but could reproduce in a scala test as well:
   
   <img width="597" alt="image" 
src="https://github.com/apache/spark/assets/1002986/a6c58c19-b9d3-44d4-8c2a-605991d3d5de";>
   
   
   This is exactly what we'd expect when you take a 64 bit number with the 
highest bit as `1` and then try to interpret it as a signed number (long). 
   
   So this PR propose some changes to the deserialization behavior. However, I 
don't know if its right to change the default or have an option to allow 
unpacking as a larger number.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] justaparth commented on pull request #41108: [SPARK-43427] spark protobuf: modify serde behavior of unsigned integer types

Reply via email to