bdrillard commented on issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro URL: https://github.com/apache/spark/pull/22878#issuecomment-473334726 @HyukjinKwon, to answer your question, when I mentioned access to "broader APIs", I meant 1. specifically the typed "[Java Function](https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/function/package-summary.html)" APIs (although Scala can infer the same types given the Encoder implementation being considered here), _and_ 2. the ability for users to author their own APIs using Spark's functions, as part of their own applications To your concern about typing: I don't think adding support for Datasets typed by any Avro Specific/Generic record really _constrains_ or overburdens any of the rest of the Spark APIs. The aggregate functions that are now purely untyped will work the same for a Dataset of arbitrary Specific/Generic record as they would for a Dataset of an arbitrary Scala Product. To workarounds in RDDs: without a first-class `Encoder`, there are significant tradeoffs in user experience in terms of API authorship. Avro can be unique in terms of the **complexity** of its schema, and that complexity can make authoring operations at the (untyped) Row level highly challenging. The rationale here would be very similar to the rationale for having an `Encoder` for arbitrary Java "Beans". Avro Schemas can be _very_ deeply nested, with more complex types. Authoring the body of a `MapFunction` against `Row` would require a lot of unsafe operations. So without the proposed `AvroEncoder`, * authoring functions at the RDD level would require staying exclusively in RDDs if type-safety is a need, which is much less fluent than the Dataset API, or it would require unsafe type coercions between Dataset<Row> and RDD of Avro types, or * authoring functions against Dataset<Row> would be highly type-unsafe, and probably require a lot of clunky `get` calls that make assumptions about the Schema and data-type of columns in the Row The issue (and benefits) of committing an `AvroEncoder` to Spark-proper has been discussed for some time. Originally, this functionality was going to be committed to the now-deprecated Spark-Avro (see [Spark-Avro#217](https://github.com/databricks/spark-avro/pull/217)), but at that time, Spark committer @marmbrus advocated making it more first-class, in Spark-proper.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
