[GitHub] [spark] bdrillard commented on issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro

GitBox Fri, 15 Mar 2019 08:39:24 -0700

bdrillard commented on issue #22878: [SPARK-25789][SQL] Support for Dataset of 
Avro
URL: https://github.com/apache/spark/pull/22878#issuecomment-473334726
 
 
   @HyukjinKwon, to answer your question, when I mentioned access to "broader 
APIs", I meant
   
   1. specifically the typed "[Java 
Function](https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/function/package-summary.html)"
 APIs (although Scala can infer the same types given the Encoder implementation 
being considered here), _and_
   2. the ability for users to author their own APIs using Spark's functions, 
as part of their own applications
   
   To your concern about typing: I don't think adding support for Datasets 
typed by any Avro Specific/Generic record really _constrains_ or overburdens 
any of the rest of the Spark APIs. The aggregate functions that are now purely 
untyped will work the same for a Dataset of arbitrary Specific/Generic record 
as they would for a Dataset of an arbitrary Scala Product.
   
   To workarounds in RDDs: without a first-class `Encoder`, there are 
significant tradeoffs in user experience in terms of API authorship.
   
   Avro can be unique in terms of the **complexity** of its schema, and that 
complexity can make authoring operations at the (untyped) Row level highly 
challenging. The rationale here would be very similar to the rationale for 
having an `Encoder` for arbitrary Java "Beans". Avro Schemas can be _very_ 
deeply nested, with more complex types. Authoring the body of a `MapFunction` 
against `Row` would require a lot of unsafe operations. So without the proposed 
`AvroEncoder`,
   
   * authoring functions at the RDD level would require staying exclusively in 
RDDs if type-safety is a need, which is much less fluent than the Dataset API, 
or it would require unsafe type coercions between Dataset<Row> and RDD of Avro 
types, or
   * authoring functions against Dataset<Row> would be highly type-unsafe, and 
probably require a lot of clunky `get` calls that make assumptions about the 
Schema and data-type of columns in the Row
   
   The issue (and benefits) of committing an `AvroEncoder` to Spark-proper has 
been discussed for some time. Originally, this functionality was going to be 
committed to the now-deprecated Spark-Avro (see 
[Spark-Avro#217](https://github.com/databricks/spark-avro/pull/217)), but at 
that time, Spark committer @marmbrus advocated making it more first-class, in 
Spark-proper.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] bdrillard commented on issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro

Reply via email to