justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181138667


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): 
AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have 
`hasField`
+    // called on them. And we only materialize zero values for fields without 
presence
+    // information because that flag controls the behavior for those fields in 
the ambiguous
+    // case of "unset" or "set to zero value". See the docs in 
[[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > Is this essentially materialize Protobuf V3 scalar fields that are not 
declared optional?
   
   yes, this is the basic effect of this PR. this follows the proto3 spec 
https://protobuf.dev/programming-guides/proto3/#default. Maybe i could have 
made this more clear earlier
   
   from the spec:
   ```
   When a message is parsed, if the encoded message does not contain a 
particular singular element, the corresponding field in the parsed object is 
set to the default value for that field.
   ```
   
   It's more precise to say that in cases where "field was never set" and 
"field was explicitly set to the zero value" are not distinguishable, (i.e. 
_field presence is not available_), the library deserializes both cases as  
`null` by default, or to the type-specific zero value with this flag. This 
phrasing correctly covers every type in proto2 and proto3, and is consistent 
with the spec about field presence.
   
   However, I understand your reservations that field presence is not the most 
accessible concept, and i'll update this documentation to say what you said 
above and then link to field presence documentation for people who want to read 
further.
   
   > What happens to a optional int in V2 proto? Will it behave like proto3 int 
proto2 optional int?
   
   no proto2 fields get default values in this implementation because theres no 
ambiguity.



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): 
AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have 
`hasField`
+    // called on them. And we only materialize zero values for fields without 
presence
+    // information because that flag controls the behavior for those fields in 
the ambiguous
+    // case of "unset" or "set to zero value". See the docs in 
[[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > If this is the case, lets update the documentation clearly say this. 
Better not to say anything about 'presence information`. We shouldn't not 
require users to go hunting down Protobuf documentation to understand it.
   
   yeah, ill update it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to