justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181193391
##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
}
}
+ private def getFieldValue(record: DynamicMessage, field: FieldDescriptor):
AnyRef = {
+ // We return a value if one of:
+ // - the field is repeated
+ // - the field is explicitly present in the serialized proto
+ // - the field is proto2 with a default
+ // - field presence is not available and materializeZeroValues is set
+ //
+ // Repeated fields have to be treated separately as they cannot have
`hasField`
+ // called on them. And we only materialize zero values for fields without
presence
+ // information because that flag controls the behavior for those fields in
the ambiguous
+ // case of "unset" or "set to zero value". See the docs in
[[ProtobufOptions]] for more
+ // details.
+ if (
+ field.isRepeated
+ || record.hasField(field)
+ || field.hasDefaultValue
+ || (!field.hasPresence && this.materializeZeroValues)) {
Review Comment:
> Remember you could not explain if Messages would be serialized or not in
earlier comment? Why, because it was not easy why they would be or wound not be
included. You tested it, but didn't have an explanation.
If you check the thread where you originally asked:
https://github.com/apache/spark/pull/40686#discussion_r1179208375
I stated that message that weren't set in the original proto would remain
null because message fields have presence information. I then linked a test
asserting this exact behavior.
I don't think theres any confusion here?
> Message is different because it can contain other messages and easily blow
up the struct.
Yeah, i agree that message is an important type. This is why having a test
to explicitly assert the deserialization, and a comment explaining it will
cements the behavior. But i'm also trying to put myself in the shoes of a
future reader of this code. If we add additional clauses to the conditional
that are redundant, we are likely to confuse future readers who will inevitably
ask "why is this code checking specifically for message types here", for which
there isn't a fundamental answer, just that we had a desire in this PR to make
sure message types were handled. But the tests / comments already should do
that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]