[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

Nicholas Chammas (JIRA) Mon, 23 Nov 2015 11:04:47 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-9999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15022735#comment-15022735
 ]


Nicholas Chammas commented on SPARK-9999:
-----------------------------------------

[~sandyr] - Hmm, so are you saying that, generally speaking, Datasets will 
provide no performance advantages over DataFrames, and that they will just help 
in terms of catching type errors early?

{quote}
Python and R are dynamically typed so can't take advantage of these.
{quote}

I can't speak for R, but Python as supported type hints since 3.0. More 
recently, Python 3.5 introduced a [typing 
module|https://docs.python.org/3/library/typing.html#module-typing] to 
standardize how type hints are specified, which facilitates the use of static 
type checkers like [mypy|http://mypy-lang.org/]. PySpark could definitely offer 
a statically type checked API, but practically speaking it would have to be 
limited to Python 3+.

I suppose people don't generally expect static type checking when they use 
Python, so perhaps it makes sense not to support Datasets in PySpark.

> Dataset API on top of Catalyst/DataFrame
> ----------------------------------------
>
>                 Key: SPARK-9999
>                 URL: https://issues.apache.org/jira/browse/SPARK-9999
>             Project: Spark
>          Issue Type: Story
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take a few more future releases to flush everything out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

Reply via email to