[ 
https://issues.apache.org/jira/browse/SPARK-43427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43427:
------------------------------------

    Assignee: Parth Upadhyay

> Unsigned integer types are deserialized as signed numeric equivalents
> ---------------------------------------------------------------------
>
>                 Key: SPARK-43427
>                 URL: https://issues.apache.org/jira/browse/SPARK-43427
>             Project: Spark
>          Issue Type: Bug
>          Components: Protobuf
>    Affects Versions: 3.4.0
>            Reporter: Parth Upadhyay
>            Assignee: Parth Upadhyay
>            Priority: Major
>              Labels: pull-request-available
>
> I'm not sure if "bug" is the correct tag for this jira, but i've tagged it 
> like that for now since the behavior seems odd, happy to update to 
> "improvement" or something else based on the conversation!
> h2. Issue
> Protobuf supports unsigned integer types, including `uint32` and `uint64`. 
> When deserializing protobuf values with fields of these types, uint32 is 
> converted to `IntegerType` and uint64 is converted to `LongType` in the 
> resulting spark struct. `IntegerType` and `LongType` are 
> [signed|https://spark.apache.org/docs/latest/sql-ref-datatypes.html] integer 
> types, so this can lead to confusing results.
> Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value 
> is above 2^63, their representation in binary will contain a 1 in the highest 
> bit, which when interpreted as a signed integer will come out as negative 
> (I.e. overflow).
> I propose that we deserialize unsigned integer types into a type that can 
> contain them correctly, e.g.
> uint32 => `LongType`
> uint64 => `Decimal(20, 0)`
> h2. Backwards Compatibility / Default Behavior
> Should we maintain backwards compatibility and we add an option that allows 
> deserializing these types differently? Or should we change change the default 
> behavior (with an option to go back to the old way)? 
> I think by default it makes more sense to deserialize them as the larger 
> types so that it's semantically more correct. However, there may be existing 
> users of this library that would be affected by this behavior change. Though, 
> maybe we can justify the change since the function is tagged as 
> `Experimental` (and spark 3.4.0 was only released very recently).
> h2. Precedent
> I believe that unsigned integer types in parquet are deserialized in a 
> similar manner, i.e. put into a larger type so that the unsigned 
> representation natively fits. 
> https://issues.apache.org/jira/browse/SPARK-34817 and 
> https://github.com/apache/spark/pull/31921



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to