justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1161754527


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || 
record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain 
the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto 
explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only 
https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // 
https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   @pang-wu thanks for the callout, i'll double check that we're testing those 
cases and if not add a new one.
   
   In the meantime, i've refactored the code here slightly. I _think_ it might 
be simpler/correct to define the behavior through the lens of [field 
presence](https://protobuf.dev/programming-guides/field_presence/), i.e.
   - if a field has presence info, we will deserialize it if present and 
otherwise fill in `null`
   - if a field **doesn't** have presence info, we only deserialize it if it is 
explicitly set, or is proto2 with an explicit default.
   
   And then, on top of that logic (which mirrored the logic as-it-was), i've 
added the `materializeZeroValues` option which will materialize the default if 
a field does not have presence information available.
   
   Does this make more sense?
   
   I think now it's easier to see how the current code is treating fields 
without presence info, i.e. it's choosing to deserialize to null instead of 
choosing a default. If we think that's the right behavior going forward, maybe 
we could consider changing the default behavior of the code (though i'm afraid 
to break any existing applications that may use this). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to