GitHub user marmbrus opened a pull request:
https://github.com/apache/spark/pull/9190
[SPARK-11116] [SQL] First Draft of Dataset API
*This PR adds a new experimental API to Spark, tentitively named Datasets.*
A `Dataset` is a strongly typed collection of objects that can be
transformed in parallel using functional or relational operations. Example
usage is as follows:
### Functional
```scala
> val ds: Dataset[Int] = Seq(1, 2, 3).toDS()
> ds.filter(_ % 1 == 0).collect()
res1: Array[Int] = Array(1, 2, 3)
```
### Relational
```scala
scala> ds.toDF().show()
+-----+
|value|
+-----+
| 1|
| 2|
| 3|
+-----+
> ds.select(expr("value + 1").as[Int]).collect()
res11: Array[Int] = Array(2, 3, 4)
```
## Comparison to RDDs
A `Dataset` differs from an `RDD` in the following ways:
- The creation of a `Dataset` requires the presence of an explicit
`Encoder` that can be
used to serialize the object into a binary format. Encoders are also
capable of mapping the
schema of a given object to the Spark SQL type system. In contrast,
RDDs rely on runtime
reflection based serialization.
- Internally, a `Dataset` is represented by a Catalyst logical plan and
the data is stored
in the encoded form. This representation allows for additional logical
operations and
enables many operations (sorting, shuffling, etc.) to be performed
without deserializing to
an object.
A `Dataset` can be converted to an `RDD` by calling the `.rdd` method.
## Comparison to DataFrames
A `Dataset` can be thought of as a specialized DataFrame, where the
elements map to a specific
JVM object type, instead of to a generic `Row` container. A DataFrame can
be transformed into
specific Dataset by calling `df.as[ElementType]`. Similarly you can
transform a strongly-typed
`Dataset` to a generic DataFrame by calling `ds.toDF()`.
## Implementation Status and TODOs
This is a rough cut at the least controversial parts of the API. The
primary purpose here is to get something committed so that we can better
parallelize further work and get early feedback on the API. The following is
being deferred to future PRs:
- Joins and Aggregations (prototype here
https://github.com/apache/spark/commit/f11f91e6f08c8cf389b8388b626cd29eec32d937)
- Support for Java
Additionally, the responsibility for binding an encoder to a given schema
is currently done in a fairly ad-hoc fashion. This is an internal detail, and
what we are doing today works for the cases we care about. However, as we add
more APIs we'll probably need to do this in a more principled way (i.e.
separate resolution from binding as we do in DataFrames).
## COMPATIBILITY NOTE
Long term we plan to make `DataFrame` extend `Dataset[Row]`. However,
making this change to che class hierarchy would break the function
signatures for the existing
function operations (map, flatMap, etc). As such, this class should be
considered a preview
of the final API. Changes will be made to the interface after Spark 1.6.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/marmbrus/spark dataset-infra
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9190.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9190
----
commit 0db295c3491b2344213647c91b7ecae2763c58ac
Author: Michael Armbrust <[email protected]>
Date: 2015-10-20T21:18:53Z
[SPARK-11116] [SQL] First Draft of Dataset API
commit f11f91e6f08c8cf389b8388b626cd29eec32d937
Author: Michael Armbrust <[email protected]>
Date: 2015-10-20T21:38:58Z
delete controversial bits
commit e32885f50462c3c66ae049c7cf0d69e90ecbedbe
Author: Michael Armbrust <[email protected]>
Date: 2015-10-20T23:02:28Z
style
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]