[ 
https://issues.apache.org/jira/browse/SPARK-53347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gurpal SINGH updated SPARK-53347:
---------------------------------
    Labels: Correctness Deserialize boolean correctness data-loss false 
from_protobuf null protobuf spark  (was: Deserialize boolean false 
from_protobuf null protobuf spark)

> Spark from_protobuf() incorrectly deserializes "false" boolean values as null
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-53347
>                 URL: https://issues.apache.org/jira/browse/SPARK-53347
>             Project: Spark
>          Issue Type: Bug
>          Components: Protobuf
>    Affects Versions: 3.5.0, 3.5.1, 3.5.2, 4.0.0
>         Environment: Scala 2.13,
> Spark 3.5.2
> JDK 17
> Maven 3.9.11
>            Reporter: Gurpal SINGH
>            Priority: Major
>              Labels: Correctness, Deserialize, boolean, correctness, 
> data-loss, false, from_protobuf, null, protobuf, spark
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> *Problem*
> When deserializing a Protobuf message using `{_}from_protobuf(){_}` in Spark, 
> boolean fields with the value `false` are incorrectly deserialized as `null`. 
> This leads to incorrect data in the resulting DataFrame and breaks semantic 
> expectations.
> *Reproduction*
> Given a Protobuf message like the following (using 
> `{_}google.protobuf.BoolValue{_}`):
>  
> {code:java}
> syntax = "proto3";
> message Example {
>   google.protobuf.BoolValue is_active = 1;
> }{code}
>  
>  
> And a message where is_active is explicitly set to false, the result of 
> _from_protobuf()_ will show null instead of false.
>  
> *Root Cause*
> In {_}ProtobufDeserializer.scala{_}, the logic for deserializing 
> google.protobuf.BoolValue relies on the getFieldValue() method, which uses 
> the following condition:
>  
> {code:java}
> if (field.isRepeated || record.hasField(field) || field.hasDefaultValue || 
> (!field.hasPresence && this.emitDefaultValues)) {
>   record.getField(field)
> } else {
>   null
> } {code}
>  
> However, for BoolValue, even when the inner value is false, 
> record.hasField(field) returns false — as the field is present but set to 
> false (not "unset"). As a result, _getFieldValue()_ returns null instead of 
> false.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to