[GitHub] [spark] justaparth commented on a diff in pull request #41498: [SPARK-44001][Protobuf] spark protobuf: handle well known wrapper types

via GitHub Thu, 08 Jun 2023 08:12:26 -0700


justaparth commented on code in PR #41498:
URL: https://github.com/apache/spark/pull/41498#discussion_r1223192039



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -247,12 +247,86 @@ private[sql] class ProtobufDeserializer(
           updater.setLong(ordinal, micros + 
TimeUnit.NANOSECONDS.toMicros(nanoSeconds))
 
       case (MESSAGE, StringType)
-          if protoType.getMessageType.getFullName == "google.protobuf.Any" =>
+        if protoType.getMessageType.getFullName == "google.protobuf.Any" =>
         (updater, ordinal, value) =>
           // Convert 'Any' protobuf message to JSON string.
           val jsonStr = jsonPrinter.print(value.asInstanceOf[DynamicMessage])
           updater.set(ordinal, UTF8String.fromString(jsonStr))
 
+      // Handle well known wrapper types. We unpack the value field instead of 
keeping

Review Comment:
   > Can we have a concrete example where this makes a difference? What problem 
are we solving? 
   > Can we have a fully spelled out example in Spark that shows the the 
benefits?
   
   Did you check the PR description? Theres an example of what the 
deserialization would look like before and after. Beyond that, i'm not sure 
what kind of example would help... Could you be specific in what you're looking 
for here?
   
   > Wrapper types are used because the wrapper is important, otherwise no need 
to use it. I don't see how stripping the wrapper is the right thing.
   > Can we have a fully spelled out example in Spark that shows the the 
benefits?
   
   Let me give you a concrete example from the existing code. There are well 
known Timestamp and Duration types, and the spark protobuf library handles them 
in a custom way to TimestampType / DayTimeIntervalType in spark.
   
   proto message in tests: 
https://github.com/apache/spark/blob/master/connector/protobuf/src/test/resources/protobuf/functions_suite.proto#L167-L175
   
   schema conversion logic: 
https://github.com/apache/spark/blob/master/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala#L79-L90
   
   deserialization logic: 
https://github.com/apache/spark/blob/master/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala#L229-L247
   
   By your argument above, we should NOT be doing this; we should treat them as 
messages and unpack as a struct.
   
   However, the library is choosing to make a decision based on the semantic 
meaning of those well known types. Namely, they're not just opaque messages but 
messages with a well known meaning, and the library is saying "i know what 
these are meant to be and i will deserialize them differently.
   
   In the same way, the rest of wrapper types are not simply opaque messages; 
they are intended to feel like primitives, and I think that we should handle 
the rest of the well known types just like were handling Timestamp / Duration.
   
   _(As an aside: the code for Timestamp and Duration seems overly broad, it'll 
catch any class that just happens to be named Timestamp or Duration, which is 
likely not what we want. And the tests aren't referencing the google protobuf 
classes, but an exact copy of them committed to the repo. I can send a separate 
PR for consideration of how to fix that)_
   
   > These are just utilities, not a Protobuf spec.
   
   Protobuf actually defines how the json output should look, so that is part 
of the spec. But I agree they never said the words "spark struct should look 
like XXX". I just think that spark struct is way closer to json (it's a data 
format that is not code) than Java classes or something, and so json provides a 
good guide. And also like I said above, the existing code already is handling 
some well known types in a special way.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] justaparth commented on a diff in pull request #41498: [SPARK-44001][Protobuf] spark protobuf: handle well known wrapper types

Reply via email to