justaparth commented on code in PR #41498:
URL: https://github.com/apache/spark/pull/41498#discussion_r1222248127
##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -247,12 +247,86 @@ private[sql] class ProtobufDeserializer(
updater.setLong(ordinal, micros +
TimeUnit.NANOSECONDS.toMicros(nanoSeconds))
case (MESSAGE, StringType)
- if protoType.getMessageType.getFullName == "google.protobuf.Any" =>
+ if protoType.getMessageType.getFullName == "google.protobuf.Any" =>
(updater, ordinal, value) =>
// Convert 'Any' protobuf message to JSON string.
val jsonStr = jsonPrinter.print(value.asInstanceOf[DynamicMessage])
updater.set(ordinal, UTF8String.fromString(jsonStr))
+ // Handle well known wrapper types. We unpack the value field instead of
keeping
Review Comment:
> Protobuf documentation says this is mainly for use with Any fields, and
for the other use case, we can use optional int32.
It's not really relevant what users of this library _could_ have done
instead of using these well known types; but as additional context, optional
didn't exist until recent protobuf versions anyways, so thats why this type
existed before that.
> Better for Spark to preserve the same information, right?
> The original purpose in Protobuf is to distinguish unset vs default value
(as you mentioned). Spark handling preserves that.
The information about presence is preserved even if we deserialize into
primitive type instead of as a struct.
> Just because of JSON serializer does this does not imply Spark should do.
I agree in principle, but in practice these json libraries represent
battle-tested and agreed upon ways to deserialize protobufs. (in comparison,
this spark protobuf parsing is relatively quite new). Spark struct and json
aren't really that different, both are a data representation of a protocol
buffer. Given that, I feel like it makes sense to use the json deserialization
as a useful guide when there are decisions to be made. And it clearly seems
like these wrappers are intended to function like primitives, not like structs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]