[ https://issues.apache.org/jira/browse/SPARK-43427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon reassigned SPARK-43427: ------------------------------------ Assignee: Parth Upadhyay > Unsigned integer types are deserialized as signed numeric equivalents > --------------------------------------------------------------------- > > Key: SPARK-43427 > URL: https://issues.apache.org/jira/browse/SPARK-43427 > Project: Spark > Issue Type: Bug > Components: Protobuf > Affects Versions: 3.4.0 > Reporter: Parth Upadhyay > Assignee: Parth Upadhyay > Priority: Major > Labels: pull-request-available > > I'm not sure if "bug" is the correct tag for this jira, but i've tagged it > like that for now since the behavior seems odd, happy to update to > "improvement" or something else based on the conversation! > h2. Issue > Protobuf supports unsigned integer types, including `uint32` and `uint64`. > When deserializing protobuf values with fields of these types, uint32 is > converted to `IntegerType` and uint64 is converted to `LongType` in the > resulting spark struct. `IntegerType` and `LongType` are > [signed|https://spark.apache.org/docs/latest/sql-ref-datatypes.html] integer > types, so this can lead to confusing results. > Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value > is above 2^63, their representation in binary will contain a 1 in the highest > bit, which when interpreted as a signed integer will come out as negative > (I.e. overflow). > I propose that we deserialize unsigned integer types into a type that can > contain them correctly, e.g. > uint32 => `LongType` > uint64 => `Decimal(20, 0)` > h2. Backwards Compatibility / Default Behavior > Should we maintain backwards compatibility and we add an option that allows > deserializing these types differently? Or should we change change the default > behavior (with an option to go back to the old way)? > I think by default it makes more sense to deserialize them as the larger > types so that it's semantically more correct. However, there may be existing > users of this library that would be affected by this behavior change. Though, > maybe we can justify the change since the function is tagged as > `Experimental` (and spark 3.4.0 was only released very recently). > h2. Precedent > I believe that unsigned integer types in parquet are deserialized in a > similar manner, i.e. put into a larger type so that the unsigned > representation natively fits. > https://issues.apache.org/jira/browse/SPARK-34817 and > https://github.com/apache/spark/pull/31921 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org