justaparth commented on code in PR #41498:
URL: https://github.com/apache/spark/pull/41498#discussion_r1223192039
##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -247,12 +247,86 @@ private[sql] class ProtobufDeserializer(
updater.setLong(ordinal, micros +
TimeUnit.NANOSECONDS.toMicros(nanoSeconds))
case (MESSAGE, StringType)
- if protoType.getMessageType.getFullName == "google.protobuf.Any" =>
+ if protoType.getMessageType.getFullName == "google.protobuf.Any" =>
(updater, ordinal, value) =>
// Convert 'Any' protobuf message to JSON string.
val jsonStr = jsonPrinter.print(value.asInstanceOf[DynamicMessage])
updater.set(ordinal, UTF8String.fromString(jsonStr))
+ // Handle well known wrapper types. We unpack the value field instead of
keeping
Review Comment:
> Can we have a concrete example where this makes a difference? What problem
are we solving?
> Can we have a fully spelled out example in Spark that shows the the
benefits?
Did you check the PR description? Theres an example of what the
deserialization would look like before and after. Beyond that, i'm not sure
what kind of example would help... Could you be specific in what you're looking
for here?
> Wrapper types are used because the wrapper is important, otherwise no need
to use it. I don't see how stripping the wrapper is the right thing.
> Can we have a fully spelled out example in Spark that shows the the
benefits?
Let me give you a concrete example from the existing code. There are well
known Timestamp and Duration types, and the spark protobuf library handles them
in a custom way to TimestampType / DayTimeIntervalType in spark.
proto message in tests:
https://github.com/apache/spark/blob/master/connector/protobuf/src/test/resources/protobuf/functions_suite.proto#L167-L175
schema conversion logic:
https://github.com/apache/spark/blob/master/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala#L79-L90
deserialization logic:
https://github.com/apache/spark/blob/master/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala#L229-L247
By your argument above, we should NOT be doing this; we should treat them as
messages and unpack as a struct.
However, the library is choosing to make a decision based on the semantic
meaning of those well known types. Namely, they're not just opaque messages but
messages with a well known meaning, and the library is saying "i know what
these are meant to be and i will deserialize them differently.
In the same way, the rest of wrapper types are not simply opaque messages;
they are intended to feel like primitives, and I think that we should handle
the rest of the well known types just like were handling Timestamp / Duration.
_(As an aside: the code for Timestamp and Duration seems overly broad, it'll
catch any class that just happens to be named Timestamp or Duration, which is
likely not what we want. And the tests aren't referencing the google protobuf
classes, but an exact copy of them committed to the repo. I can send a separate
PR for consideration of how to fix that)_
> These are just utilities, not a Protobuf spec.
Protobuf actually defines how the json output should look, so that is part
of the spec. But I agree they never said the words "spark struct should look
like XXX". I just think that spark struct is way closer to json (it's a data
format that is not code) than Java classes or something, and so json provides a
good guide. And also like I said above, the existing code already is handling
some well known types in a special way.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]