[GitHub] spark pull request: [SPARK-11116] [SQL] First Draft of Dataset API

marmbrus Tue, 20 Oct 2015 18:15:07 -0700

GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/9190


    [SPARK-11116] [SQL] First Draft of Dataset API

    *This PR adds a new experimental API to Spark, tentitively named Datasets.*
    
    A `Dataset` is a strongly typed collection of objects that can be 
transformed in parallel using functional or relational operations.  Example 
usage is as follows:
    
    ### Functional
    ```scala
    > val ds: Dataset[Int] = Seq(1, 2, 3).toDS()
    > ds.filter(_ % 1 == 0).collect()
    res1: Array[Int] = Array(1, 2, 3)
    ```
    
    ### Relational
    ```scala
    scala> ds.toDF().show()
    +-----+
    |value|
    +-----+
    |    1|
    |    2|
    |    3|
    +-----+
    
    > ds.select(expr("value + 1").as[Int]).collect()
    res11: Array[Int] = Array(2, 3, 4)
    ```
    
    ## Comparison to RDDs
     A `Dataset` differs from an `RDD` in the following ways:
      - The creation of a `Dataset` requires the presence of an explicit 
`Encoder` that can be
        used to serialize the object into a binary format.  Encoders are also 
capable of mapping the
        schema of a given object to the Spark SQL type system.  In contrast, 
RDDs rely on runtime
        reflection based serialization.
      - Internally, a `Dataset` is represented by a Catalyst logical plan and 
the data is stored
        in the encoded form.  This representation allows for additional logical 
operations and
        enables many operations (sorting, shuffling, etc.) to be performed 
without deserializing to
        an object.
    
    A `Dataset` can be converted to an `RDD` by calling the `.rdd` method.
    
    ## Comparison to DataFrames
    
    A `Dataset` can be thought of as a specialized DataFrame, where the 
elements map to a specific
    JVM object type, instead of to a generic `Row` container. A DataFrame can 
be transformed into
    specific Dataset by calling `df.as[ElementType]`.  Similarly you can 
transform a strongly-typed
    `Dataset` to a generic DataFrame by calling `ds.toDF()`.
    
    ## Implementation Status and TODOs
    
    This is a rough cut at the least controversial parts of the API.  The 
primary purpose here is to get something committed so that we can better 
parallelize further work and get early feedback on the API.  The following is 
being deferred to future PRs:
     - Joins and Aggregations (prototype here 
https://github.com/apache/spark/commit/f11f91e6f08c8cf389b8388b626cd29eec32d937)
     - Support for Java
    
    Additionally, the responsibility for binding an encoder to a given schema 
is currently done in a fairly ad-hoc fashion.  This is an internal detail, and 
what we are doing today works for the cases we care about.  However, as we add 
more APIs we'll probably need to do this in a more principled way (i.e. 
separate resolution from binding as we do in DataFrames).
    
    ## COMPATIBILITY NOTE
    Long term we plan to make `DataFrame` extend `Dataset[Row]`.  However,
    making this change to che class hierarchy would break the function 
signatures for the existing
    function operations (map, flatMap, etc).  As such, this class should be 
considered a preview
    of the final API.  Changes will be made to the interface after Spark 1.6.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark dataset-infra

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9190.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9190
    
----
commit 0db295c3491b2344213647c91b7ecae2763c58ac
Author: Michael Armbrust <[email protected]>
Date:   2015-10-20T21:18:53Z

    [SPARK-11116] [SQL] First Draft of Dataset API

commit f11f91e6f08c8cf389b8388b626cd29eec32d937
Author: Michael Armbrust <[email protected]>
Date:   2015-10-20T21:38:58Z

    delete controversial bits

commit e32885f50462c3c66ae049c7cf0d69e90ecbedbe
Author: Michael Armbrust <[email protected]>
Date:   2015-10-20T23:02:28Z

    style

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-11116] [SQL] First Draft of Dataset API

Reply via email to