[jira] [Updated] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

Reynold Xin (JIRA) Tue, 03 Nov 2015 07:32:30 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-9999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Reynold Xin updated SPARK-9999:
-------------------------------
    Description: 
The RDD API is very flexible, and as a result harder to optimize its execution 
in some cases. The DataFrame API, on the other hand, is much easier to 
optimize, but lacks some of the nice perks of the RDD API (e.g. harder to use 
UDFs, lack of strong types in Scala/Java).

The goal of Spark Datasets is to provide an API that allows users to easily 
express transformations on domain objects, while also providing the performance 
and robustness advantages of the Spark SQL execution engine.

h2. Requirements
 - *Fast* - In most cases, the performance of Datasets should be equal to or 
better than working with RDDs.  Encoders should be as fast or faster than Kryo 
and Java serialization, and unnecessary conversion should be avoided.
 - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
objects should provide compile-time safety where possible.  When converting 
from data where the schema is not known at compile-time (for example data read 
from an external source such as JSON), the conversion function should fail-fast 
if there is a schema mismatch.
 - *Support for a variety of object models* - Default encoders should be 
provided for a variety of object models: primitive types, case classes, tuples, 
POJOs, JavaBeans, etc.  Ideally, objects that follow standard conventions, such 
as Avro SpecificRecords, should also work out of the box.
 - *Java Compatible* - Datasets should provide a single API that works in both 
Scala and Java.  Where possible, shared types like Array will be used in the 
API.  Where not possible, overloaded functions should be provided for both 
languages.  Scala concepts, such as ClassTags should not be required in the 
user-facing API.
 - *Interoperates with DataFrames* - Users should be able to seamlessly 
transition between Datasets and DataFrames, without specifying conversion 
boiler-plate.  When names used in the input schema line-up with fields in the 
given class, no extra mapping should be necessary.  Libraries like MLlib should 
not need to provide different interfaces for accepting DataFrames and Datasets 
as input.

For a detailed outline of the complete proposed API: 
[marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
For an initial discussion of the design considerations in this API: [design 
doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]

The initial version of the Dataset API has been merged in Spark 1.6. However, 
it will take a few more future releases to flush everything out.

  was:
The RDD API is very flexible, and as a result harder to optimize its execution 
in some cases. The DataFrame API, on the other hand, is much easier to 
optimize, but lacks some of the nice perks of the RDD API (e.g. harder to use 
UDFs, lack of strong types in Scala/Java).

The goal of Spark Datasets is to provide an API that allows users to easily 
express transformations on domain objects, while also providing the performance 
and robustness advantages of the Spark SQL execution engine.

h2. Requirements
 - *Fast* - In most cases, the performance of Datasets should be equal to or 
better than working with RDDs.  Encoders should be as fast or faster than Kryo 
and Java serialization, and unnecessary conversion should be avoided.
 - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
objects should provide compile-time safety where possible.  When converting 
from data where the schema is not known at compile-time (for example data read 
from an external source such as JSON), the conversion function should fail-fast 
if there is a schema mismatch.
 - *Support for a variety of object models* - Default encoders should be 
provided for a variety of object models: primitive types, case classes, tuples, 
POJOs, JavaBeans, etc.  Ideally, objects that follow standard conventions, such 
as Avro SpecificRecords, should also work out of the box.
 - *Java Compatible* - Datasets should provide a single API that works in both 
Scala and Java.  Where possible, shared types like Array will be used in the 
API.  Where not possible, overloaded functions should be provided for both 
languages.  Scala concepts, such as ClassTags should not be required in the 
user-facing API.
 - *Interoperates with DataFrames* - Users should be able to seamlessly 
transition between Datasets and DataFrames, without specifying conversion 
boiler-plate.  When names used in the input schema line-up with fields in the 
given class, no extra mapping should be necessary.  Libraries like MLlib should 
not need to provide different interfaces for accepting DataFrames and Datasets 
as input.

For a detailed outline of the complete proposed API: 
[marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
For an initial discussion of the design considerations in this API: [design 
doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]


> Dataset API on top of Catalyst/DataFrame
> ----------------------------------------
>
>                 Key: SPARK-9999
>                 URL: https://issues.apache.org/jira/browse/SPARK-9999
>             Project: Spark
>          Issue Type: Story
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take a few more future releases to flush everything out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

Reply via email to