[
https://issues.apache.org/jira/browse/SPARK-9999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356644#comment-15356644
]
Reynold Xin commented on SPARK-9999:
------------------------------------
After thinking about that more, I don't think it will happen any time soon. We
simply don't see the strong benefit with Python to have a typed-safe way to
work with data. Afterall, Python itself has no compile time type safety.
> Dataset API on top of Catalyst/DataFrame
> ----------------------------------------
>
> Key: SPARK-9999
> URL: https://issues.apache.org/jira/browse/SPARK-9999
> Project: Spark
> Issue Type: Story
> Components: SQL
> Reporter: Reynold Xin
> Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>
> The RDD API is very flexible, and as a result harder to optimize its
> execution in some cases. The DataFrame API, on the other hand, is much easier
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily
> express transformations on domain objects, while also providing the
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
> - *Fast* - In most cases, the performance of Datasets should be equal to or
> better than working with RDDs. Encoders should be as fast or faster than
> Kryo and Java serialization, and unnecessary conversion should be avoided.
> - *Typesafe* - Similar to RDDs, objects and functions that operate on those
> objects should provide compile-time safety where possible. When converting
> from data where the schema is not known at compile-time (for example data
> read from an external source such as JSON), the conversion function should
> fail-fast if there is a schema mismatch.
> - *Support for a variety of object models* - Default encoders should be
> provided for a variety of object models: primitive types, case classes,
> tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard
> conventions, such as Avro SpecificRecords, should also work out of the box.
> - *Java Compatible* - Datasets should provide a single API that works in
> both Scala and Java. Where possible, shared types like Array will be used in
> the API. Where not possible, overloaded functions should be provided for
> both languages. Scala concepts, such as ClassTags should not be required in
> the user-facing API.
> - *Interoperates with DataFrames* - Users should be able to seamlessly
> transition between Datasets and DataFrames, without specifying conversion
> boiler-plate. When names used in the input schema line-up with fields in the
> given class, no extra mapping should be necessary. Libraries like MLlib
> should not need to provide different interfaces for accepting DataFrames and
> Datasets as input.
> For a detailed outline of the complete proposed API:
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However,
> it will take a few more future releases to flush everything out.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]