bdrillard edited a comment on issue #24299: [SPARK-27388][SQL] expression 
encoder for objects defined by properties
URL: https://github.com/apache/spark/pull/24299#issuecomment-481773895
 
 
   Many thanks to @mazeboard for bringing this PR to my attention and for 
taking a crack at the problem of typed Avro Datasets in Spark!
   
   I feel it's a matter of due diligence for me to point to another PR 
supporting Avro-typed Datasets in Spark, namely #22878, which (full 
transparency) is the work of @xuanyuanking and myself. The approaches taken 
here and there are different, and it would seem so are the coverage of the Avro 
spec. I'd like to take the time to compare/contrast.
   
   I am more qualified to speak to the approach and capabilities introduced 
#22878 (which has a history going back to 
[Spark-Avro](https://github.com/databricks/spark-avro/pull/217)), and so if I 
misread this PR in that process, @mazeboard, please do correct my understanding.
   
   I apologize for the length of this reply ahead of time!
   
   ## #24299
   
   I'll summarize my reading of this PR's approach: to extend the existing Java 
`Bean` Encoder functionality to more broadly support tricky types generated by 
Avro `SpecificRecord` classes, especially Bytes, which Avro doesn't allow 
access to via typical getters/setters, and Enums, which have to be encoded as 
Strings and therefore have to be “discovered” via their class.
   
   One stated limitation is complex Union types, which Avro will represent as 
nested Object. It’s stated that there isn’t an Encoder for Object type (I think 
a serializer/deserializer could _perhaps_ be made using the 
[Object](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala)
 and 
[NewInstance](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L431)
 , but I can’t say how it would work in this Reflection-driven approach). It’s 
said things could get tough with how Avro serializes those objects when things 
like Kyro are used. I can see that as a limitation to this Bean-based approach 
even if the DataTypes and Expressions were sorted out.
   
   Correct my understanding if this is wrong, but because this approach is 
based on Reflection over the types of the generated SpecificRecord class 
(viewing Avro as a Bean), it would not be (in a first-class sense) “aware” of 
the AvroSchema which generates the Class. I think this distinction may matter, 
and I’ll discuss it more below.
   
   ## #22878
   
   To summarize the approach of #22878: create an AvroEncoder which generates a 
pair of serializer/deserializer Expressions based on the AvroSchema.
   
   This work stands on the Avro efforts that were recently folded into 
Spark-proper from the (now deprecated) Databricks/Spark-Avro project, but which 
still _does not_ provide support for a first-class Encoder, and the efficiency 
and Strong/Static typing that entails. Being based on Spark-Avro however, 
#22878 _does_ gain benefits of an AvroSchema driven approach. To avoid 
confusion, I’m going to refer to this folded-in functionality as “Spark-Avro”, 
relating to [this](https://github.com/apache/spark/tree/master/external/avro) 
portion of the current Spark project, rather than to the former deprecated 
project.
   
   ### AvroSchema
   
   Because #22878 generates its Encoder through the AvroSchema, we gain a 
couple things:
   
   1. An isomorphism between the Dataset Schema and the source of truth for the 
structure of any Avro, namely, its AvroSchema. The important thing here is we 
can generate an Encoder both from the Class and the AvroSchema in its different 
representations, like JSON, which opens support for Schema evolution, and use 
of Schema stores. 
   2. Direct support Encoders of any `GenericRecord`, where an AvroSchema is 
known, but no `SpecificRecord` class can be generated.
   
   ### Coverage
   
   Spark-Avro’s ability to move between AvroSchema and Spark Dataset Schema 
also gives us the ability to traverse the DataType to create our Encoder’s 
Ser/De Expressions rather than using Reflection. This gives us two immediate 
benefits
   
   1. The qualities of Bytes and Enums are more transparently represented in 
the AvroSchema, and so the SerDe expressions can be more directly generated.
   2. Nested Records and Complex Unions (as well as Avro Lists and Maps) are a 
solved-problem in Spark-Avro, so we can generate a Dataset Schema of arbitrary 
complexity and nesting.
   3. Naturally, we don’t have to extend the Bean Encoder for properties.
   
   These two items mean the AvroEncoder in #22878 can generate an `Encoder` 
having full coverage of Avro types, and this coverage of the various 
combinations of types that can appear when adding nested Records and Unions is 
well tested in the PR.
   
   ## Last thoughts
   
   The PR goes a long way in support of Avro while still being very concise, 
which is definitely advantageous from a maintainability perspective. My 
concerns with a Reflection-based approach are:
   
   1. It’s unknown or perhaps would prove difficult to extend support for 
(especially) Union types, which are _very_ common in Avro and the _only_ way to 
express Nullability in Avro.
   2. A Reflection-based approach also foregoes Datasets of `GenericRecord` 
objects.
   
   Parting words for #22878: 
   
   1. While it’s length (in terms of lines of code) as a feature has been 
discussed, however, I’d say ultimately it’s very well tested, considering we’re 
ultimately testing a closed set of type combinations described by AvroSchema, 
and nested complex types are tested via inductive unit tests. Anecdotally, I’ll 
say we’ve been using it in a fork of Spark-Avro with great success over 
exceptionally complicated Schemas.
   2. It fits quite natively in the new Spark-Avro “external” sub-project, and 
accomplishes the goal: providing an Encoder for Datasets typed by arbitrary 
Avro (Generic and Specific).
   
   Again, I'm very happy to see where this discussion goes as it evolves (:

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to