justaparth commented on PR #41108:
URL: https://github.com/apache/spark/pull/41108#issuecomment-1556292297
> Where is the information loss or overflow? Java code generated by Protobuf
for a uint32 field also returns an `int`, not `long`.
sorry i didn't get a chance to reply to this until now. There is no
information loss, technically, as uint32 is 4 bytes and uint64 is 8 bytes, same
as int and long respectively. However, there is overflow in the representation.
Here's an example:
Consider a protobuf message like:
```
syntax = "proto3";
message Test {
uint64 val = 1;
}
```
Generate a protobuf with a value above 2^63. I did this in python with a
small script like:
```
import test_pb2
s = test_pb2.Test()
s.val = 9223372036854775809 # 2**63 + 1
serialized = s.SerializeToString()
print(serialized)
```
This generates the binary representation:
```
b'\x08\x81\x80\x80\x80\x80\x80\x80\x80\x80\x01'
```
Then, deserialize this using `from_protobuf`. I did this in a notebook so
its easier to see, but could reproduce in a scala test as well:
<img width="597" alt="image"
src="https://github.com/apache/spark/assets/1002986/a6c58c19-b9d3-44d4-8c2a-605991d3d5de">
This is exactly what we'd expect when you take a 64 bit number with the
highest bit as `1` and then try to interpret it as a signed number (long).
So this PR propose some changes to the deserialization behavior. However, I
don't know if its right to change the default or have an option to allow
unpacking as a larger number.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]