rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181155546


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it 
will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = 
parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the 
field
+  // was never set or if the field was explicitly set to zero. By default, the 
library

Review Comment:
   This feature does not solve the ambiguity, right? I still don't know what 
problem this is solving. What is an example usecase?



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,36 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): 
AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - we should materialize the zero value
+    //
+    // Repeated fields have to be treated separately as they cannot have 
`hasField`
+    // called on them.
+    if (
+      field.isRepeated
+        || record.hasField(field) // checks for the field explicitly being set
+        || field.hasDefaultValue
+        || shouldGetZeroValue(record, field)) {
+      record.getField(field)
+    } else {
+      null
+    }
+  }
+
+  // See the docs for materialize.default.values in [[ProtobufOptions]].
+  // Decides whether we should get the zero values. We do this when
+  // the field is not present in the proto, and theres ambiguity about
+  // whether it was never set or if it was set to its zero value (i.e.
+  // the field lacks field presence information). This is basically
+  // proto3 scalar values only.
+  private def shouldGetZeroValue(record: DynamicMessage, field: 
FieldDescriptor): Boolean = {
+    !record.hasField(field) && !field.hasPresence && this.materializeZeroValues

Review Comment:
   lets move this to `getFieldValue()`.  We can see `!record.hasField(field)` 
is not needed. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it 
will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = 
parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the 
field
+  // was never set or if the field was explicitly set to zero. By default, the 
library
+  // deserializes this situation as `null`, but with this flag will 
deserialize them
+  // as the type-specific default value.
+  //
+  // Note that this won't affect fields with the optional keyword in proto3, or
+  // any fields in proto2, as they don't have the ambiguity described above 
because
+  // they have presence information.
+  //
+  // As an example:
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   string name = 1;
+  //   int64 age = 2;
+  //   optional string middle_name = 3;
+  //   optional int64 salary = 4;
+  // }
+  // ```
+  //
+  // And we have the serialized representation of the following proto:
+  // `Person(age=20, middle_name="smith")

Review Comment:
   You meant 'age=0'? 
   Use default value for 'middle_name' as well. That is the only case we are 
talking about.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to