justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160316173
##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
var skipRow = false
while (i < validFieldIndexes.length && !skipRow) {
val field = validFieldIndexes(i)
- val value = if (field.isRepeated || field.hasDefaultValue ||
record.hasField(field)) {
+
+ // If `materializeZeroValues` is true, the written field will contain
the
+ // default zero value for its type (e.g. 0 for int, "" for string, etc:
+ // https://protobuf.dev/programming-guides/proto3/#default).
+ // Otherwise, the field will be null unless the serialized proto
explicitly
+ // contains the field, or it has an explicit default (
+ // proto2 only
https://protobuf.dev/programming-guides/proto2/#optional)
+ // Note that in proto3, the serialized proto will not contain the field
+ // if the value was explicitly set to its zero value. See:
+ //
https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis
Review Comment:
> Hmm, is it a possible bug? If a field is explicitly set to zero value, the
serialized proto message won't contain it, so when deserializing the message we
will get null instead of zero?
> a field set with zero value will not be present in serialized message, so
when deserializing it, it will be given a null instead of zero value originally?
@viirya i'm not sure; in some sense it could be considered a bug and in
another sense it could be considered a convention. I think the behavior of
filling in defaults makes sense because of the proto3 spec Pang linked above.
But because there is no way to distinguish between "field is set to zero
value" and "field is not present" in proto3 serialized messages,
deserialization libraries much choose what to do, either
a) if a field is not present in serialized value, deserialize as null [the
default behavior in spark-protobuf]
- in case the field was not set originally, this is "correct"
- in case the field was set to the zero value this is "incorrect"
b) if a field is not present, deserialize to its zero value [what i tried to
add in this PR]
- in case the field was not set originally, this is "incorrect"
- in case the field was set to the zero value, this is "correct"
So either way the library chooses, one of the cases is kind of strange (i.e.
you set a value and it doesn't come out, or you didn't set a value and yet a
default comes out). Given this, it does feel like it's not clear which is the
right "default". It also depends on whether the original message is proto2 or
proto3, i think. Maybe what we would actually want is:
- for proto3 serialized messages, zero-value materialization is the default
behavior since [field
presence](https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis)
is not available
- for proto2 serailazed messages, check for field presence since that info
is
[available](https://protobuf.dev/programming-guides/field_presence/#presence-in-proto2-apis)
?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]