[GitHub] [spark] bdrillard edited a comment on issue #24299: [SPARK-27388][SQL] expression encoder for objects defined by properties

GitBox Wed, 10 Apr 2019 10:12:09 -0700

bdrillard edited a comment on issue #24299: [SPARK-27388][SQL] expression 
encoder for objects defined by properties
URL: https://github.com/apache/spark/pull/24299#issuecomment-481773895
 
 
   Many thanks to @mazeboard for bringing this PR to my attention and for 
taking a crack at the problem of typed Avro Datasets in Spark!
   
   I feel it's a matter of due diligence for me to point to another PR 
supporting Avro-typed Datasets in Spark, namely #22878, which (full 
transparency) is the work of @xuanyuanking and myself. The approaches taken 
here and there are different, and it would seem so are the coverages of the 
Avro spec. I'd like to take the time to compare/contrast.
   
   I am more qualified to speak to the approach and capabilities introduced 
#22878 (which has a history going back to 
[Spark-Avro](https://github.com/databricks/spark-avro/pull/217)), and so if I 
misread this PR in the process, @mazeboard, please do correct my understanding.
   
   ## #24299
   
   I'll summarize my reading of this PR's approach: to extend the existing Java 
`Bean` Encoder functionality to more broadly support tricky types generated by 
Avro `SpecificRecord` classes, especially Bytes, which Avro doesn't allow 
access to via typical getters/setters, and Enums, which have to be encoded as 
Strings and therefore have to be “discovered” via their class.
   
   One stated limitation is complex Union types, which Avro will represent as 
nested Object. It’s stated that there isn’t an Encoder for Object type (I think 
a serializer/deserializer could _perhaps_ be made using the 
[Object](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala)
 and 
[NewInstance](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L431)
 , but I can’t say how it would work in this Reflection-driven approach). It’s 
said things could get tough with how Avro serializes those objects when things 
like Kyro are used. I can see that as a limitation to this Bean-based approach 
even if the DataTypes and Expressions were sorted out.
   
   Correct my understanding if this is wrong, but because this approach is 
based on Reflection over the types of the generated `SpecificRecord` class 
(viewing the record as a Bean), it would not be (in a first-class sense) 
“aware” of the AvroSchema which generates the Class. I think this distinction 
may matter, and I’ll discuss it more below.
   
   ## #22878
   
   To summarize the approach of #22878: creates an AvroEncoder which generates 
a pair of serializer/deserializer Expressions based on the AvroSchema (whether 
as gleaned from a SpecificRecord class, or as passed directly as a JSON string).
   
   Its work stands on the Avro efforts that were recently folded into 
Spark-proper from the (now deprecated) Databricks/Spark-Avro project. However, 
to date, Spark-Avro _does not_ provide support for a first-class Encoder, along 
with the efficiency and Strong/Static typing that entails. Being based on 
Spark-Avro however, #22878 _does_ gain benefits of an AvroSchema driven 
approach. To avoid confusion, I’m going to refer to this recently folded-in 
functionality as “Spark-Avro”, referencing 
[this](https://github.com/apache/spark/tree/master/external/avro) portion of 
the current Spark project, rather than to the former deprecated project.
   
   ### AvroSchema
   
   Because #22878 generates its Encoder through the AvroSchema, we gain a 
couple things:
   
   1. An isomorphism between the Dataset Schema and the source of truth for the 
structure of any Avro, namely, its AvroSchema. The important thing here is we 
can generate an Encoder both from the Class and the AvroSchema in its different 
representations, like JSON, which opens support for Schema evolution, and use 
of Schema stores. 
   2. Direct support Encoders of any `GenericRecord`, where an AvroSchema is 
known, but no `SpecificRecord` class can be generated.
   
   With #22878 you can create an `Encoder` via both
   
   ```
   var Dataset[MyClass] ds = AvroEncoder.of(MyClass.class)
   // or
   var Dataset[MyClass] ds = AvroEncoder.of(myClassAvroSchema) // a JSON string 
for a SpecificRecord
   var Dataset[GenericRecord] ds = AvroEncoder.of(genericRecordAvroSchema) // a 
JSON string for a GenericRecord
   ```
   
   ### Coverage
   
   Spark-Avro’s ability to move between AvroSchema and Spark Dataset Schema 
also gives us the ability to traverse the DataType to create our Encoder’s 
Ser/De Expressions rather than using Reflection. This gives us two immediate 
benefits
   
   1. The qualities of Bytes and Enums are more transparently represented in 
the AvroSchema, and so the SerDe expressions can be more directly generated.
   2. Nested Records and Complex Unions (as well as Avro Lists and Maps) are a 
solved-problem in Spark-Avro, so we can generate a Dataset Schema of arbitrary 
complexity and nesting (using the `Object` DataType and `NewInstance` 
Expression I mentioned previously).
   3. Naturally, we don’t have to extend the Bean Encoder for properties.
   
   These two items mean the AvroEncoder in #22878 can generate an `Encoder` 
having full coverage of Avro types, and this coverage of the various 
combinations of types that can appear when adding nested Records and Unions is 
well tested in the PR.
   
   ## Last thoughts
   
   The PR goes a long way in support of Avro while still being very concise, 
which is definitely advantageous from a maintainability perspective. My 
concerns with a Reflection-based approach are:
   
   1. It’s unknown or perhaps would prove difficult to extend support for 
(especially) Union types, which are _very_ common in Avro and the _only_ way to 
express Nullability in Avro.
   2. A Reflection-based approach also foregoes (at least at my reading) 
Datasets of `GenericRecord` objects.
   
   Parting words for #22878: 
   
   1. While it’s length (in terms of lines of code) as a feature has been 
discussed, however, I’d say ultimately it’s [very well 
tested](https://github.com/apache/spark/pull/22878/files#diff-24ca7610c9c163779104e6c797713431R31),
 considering we’re ultimately testing a closed set of type combinations 
described by AvroSchema, and nested complex types are tested via inductive unit 
tests. Anecdotally, I’ll say we’ve been using it in a fork of Spark-Avro with 
great success over exceptionally complicated Schemas.
   2. It fits quite natively in the new Spark-Avro “external” sub-project, and 
accomplishes the goal: providing an Encoder for Datasets typed by arbitrary 
Avro (Generic and Specific), with native support for Schema evolution.
   
   Again, I'm very happy to see where this discussion goes as it evolves (:


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] bdrillard edited a comment on issue #24299: [SPARK-27388][SQL] expression encoder for objects defined by properties

Reply via email to