[GitHub] [spark] bdrillard edited a comment on issue #24299: [SPARK-27388][SQL] expression encoder for objects defined by properties

GitBox Wed, 10 Apr 2019 09:56:51 -0700

bdrillard edited a comment on issue #24299: [SPARK-27388][SQL] expression
encoder for objects defined by properties
URL: https://github.com/apache/spark/pull/24299#issuecomment-481773895

Many thanks to @mazeboard for bringing this PR to my attention and for
taking a crack at the problem of typed Avro Datasets in Spark!

I feel it's a matter of due diligence for me to point to another PR
supporting Avro-typed Datasets in Spark, namely #22878, which (full
transparency) is the work of @xuanyuanking and myself. The approaches taken
here and there are different, and it would seem so are the coverage of the Avro
spec. I'd like to take the time to compare/contrast.

I am more qualified to speak to the approach and capabilities introduced
#22878 (which has a history going back to
[Spark-Avro](https://github.com/databricks/spark-avro/pull/217)), and so if I
misread this PR in that process, @mazeboard, please do correct my understanding.

I apologize for the length of this reply ahead of time!

## #24299

I'll summarize my reading of this PR's approach: to extend the existing Java
`Bean` Encoder functionality to more broadly support tricky types generated by
Avro `SpecificRecord` classes, especially Bytes, which Avro doesn't allow
access to via typical getters/setters, and Enums, which have to be encoded as
Strings and therefore have to be “discovered” via their class.

One stated limitation is complex Union types, which Avro will represent as
nested Object. It’s stated that there isn’t an Encoder for Object type (I think
a serializer/deserializer could _perhaps_ be made using the
[Object](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala)
and
[NewInstance](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L431)
, but I can’t say how it would work in this Reflection-driven approach). It’s
said things could get tough with how Avro serializes those objects when things
like Kyro are used. I can see that as a limitation to this Bean-based approach
even if the DataTypes and Expressions were sorted out.

Correct my understanding if this is wrong, but because this approach is
based on Reflection over the types of the generated SpecificRecord class
(viewing Avro as a Bean), it would not be (in a first-class sense) “aware” of
the AvroSchema which generates the Class. I think this distinction may matter,
and I’ll discuss it more below.

## #22878

To summarize the approach of #22878: create an AvroEncoder which generates a
pair of serializer/deserializer Expressions based on the AvroSchema.

This work stands on the Avro efforts that were recently folded into
Spark-proper from the (now deprecated) Databricks/Spark-Avro project, but which
still _does not_ provide support for a first-class Encoder, and the efficiency
and Strong/Static typing that entails. Being based on Spark-Avro however,
#22878 _does_ gain benefits of an AvroSchema driven approach. To avoid
confusion, I’m going to refer to this folded-in functionality as “Spark-Avro”,
relating to [this](https://github.com/apache/spark/tree/master/external/avro)
portion of the current Spark project, rather than to the former deprecated
project.

### AvroSchema

Because #22878 generates its Encoder through the AvroSchema, we gain a
couple things:

1. An isomorphism between the Dataset Schema and the source of truth for the
structure of any Avro, namely, its AvroSchema. The important thing here is we
can generate an Encoder both from the Class and the AvroSchema in its different
representations, like JSON, which opens support for Schema evolution, and use
of Schema stores.
2. Direct support Encoders of any `GenericRecord`, where an AvroSchema is
known, but no `SpecificRecord` class can be generated.

### Coverage

Spark-Avro’s ability to move between AvroSchema and Spark Dataset Schema
also gives us the ability to traverse the DataType to create our Encoder’s
Ser/De Expressions rather than using Reflection. This gives us two immediate
benefits

1. The qualities of Bytes and Enums are more transparently represented in
the AvroSchema, and so the SerDe expressions can be more directly generated.
2. Nested Records and Complex Unions (as well as Avro Lists and Maps) are a
solved-problem in Spark-Avro, so we can generate a Dataset Schema of arbitrary
complexity and nesting.
3. Naturally, we don’t have to extend the Bean Encoder for properties.

These two items mean the AvroEncoder in #22878 can generate an `Encoder`
having full coverage of Avro types, and this coverage of the various
combinations of types that can appear when adding nested Records and Unions is
well tested in the PR.

## Last thoughts

The PR goes a long way in support of Avro while still being very concise,
which is definitely advantageous from a maintainability perspective. My
concerns with a Reflection-based approach are:

1. It’s unknown or perhaps would prove difficult to extend support for
(especially) Union types, which are _very_ common in Avro and the _only_ way to
express Nullability in Avro.
2. A Reflection-based approach also foregoes Datasets of `GenericRecord`
objects.

Parting words for #22878:

1. While it’s length (in terms of lines of code) as a feature has been
discussed, however, I’d say ultimately it’s very well tested, considering we’re
ultimately testing a closed set of type combinations described by AvroSchema,
and nested complex types are tested via inductive unit tests. Anecdotally, I’ll
say we’ve been using it in a fork of Spark-Avro with great success over
exceptionally complicated Schemas.
2. It fits quite natively in the new Spark-Avro “external” sub-project, and
accomplishes the goal: providing an Encoder for Datasets typed by arbitrary
Avro (Generic and Specific).

Again, I'm very happy to see where this discussion goes as it evolves (:


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] bdrillard edited a comment on issue #24299: [SPARK-27388][SQL] expression encoder for objects defined by properties

Reply via email to