[ https://issues.apache.org/jira/browse/SPARK-53347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gurpal SINGH updated SPARK-53347: --------------------------------- Labels: Correctness Deserialize boolean correctness data-loss false from_protobuf null protobuf spark (was: Deserialize boolean false from_protobuf null protobuf spark) > Spark from_protobuf() incorrectly deserializes "false" boolean values as null > ----------------------------------------------------------------------------- > > Key: SPARK-53347 > URL: https://issues.apache.org/jira/browse/SPARK-53347 > Project: Spark > Issue Type: Bug > Components: Protobuf > Affects Versions: 3.5.0, 3.5.1, 3.5.2, 4.0.0 > Environment: Scala 2.13, > Spark 3.5.2 > JDK 17 > Maven 3.9.11 > Reporter: Gurpal SINGH > Priority: Major > Labels: Correctness, Deserialize, boolean, correctness, > data-loss, false, from_protobuf, null, protobuf, spark > Original Estimate: 96h > Remaining Estimate: 96h > > *Problem* > When deserializing a Protobuf message using `{_}from_protobuf(){_}` in Spark, > boolean fields with the value `false` are incorrectly deserialized as `null`. > This leads to incorrect data in the resulting DataFrame and breaks semantic > expectations. > *Reproduction* > Given a Protobuf message like the following (using > `{_}google.protobuf.BoolValue{_}`): > > {code:java} > syntax = "proto3"; > message Example { > google.protobuf.BoolValue is_active = 1; > }{code} > > > And a message where is_active is explicitly set to false, the result of > _from_protobuf()_ will show null instead of false. > > *Root Cause* > In {_}ProtobufDeserializer.scala{_}, the logic for deserializing > google.protobuf.BoolValue relies on the getFieldValue() method, which uses > the following condition: > > {code:java} > if (field.isRepeated || record.hasField(field) || field.hasDefaultValue || > (!field.hasPresence && this.emitDefaultValues)) { > record.getField(field) > } else { > null > } {code} > > However, for BoolValue, even when the inner value is false, > record.hasField(field) returns false — as the field is present but set to > false (not "unset"). As a result, _getFieldValue()_ returns null instead of > false. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org