GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/11347
[SPARK-13233][SQL][WIP] Python Dataset (basic version)
## What changes were proposed in this pull request?
This PR introduce a new API: Python dataset. Conceptually it's a
combination of Python DataFrame and Python RDD, supports both typed
operations(e.g. map, flatMap, filter, etc.) and untyped operations(e.g. select,
sort, etc.). This is a simpler version of
https://github.com/apache/spark/pull/11117, without the aggregate part.
## How was this patch tested?
new tests are added in pyspark/sql/tests.py
TODO:
* add documents
* more tests
* fix all corner cases
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark pydataset
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11347.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11347
----
commit ddfdbbdf9c052f078dd914dfa6ae54de6c633d46
Author: Wenchen Fan <[email protected]>
Date: 2016-02-24T09:19:11Z
tmp
commit fb0e7f497538390e631d16c9d27ef3c03e4e4b8a
Author: Wenchen Fan <[email protected]>
Date: 2016-02-24T13:58:52Z
python dataset
commit a073f831adfd7640c40f9455ddcc02b9667db9ad
Author: Wenchen Fan <[email protected]>
Date: 2016-02-24T14:19:25Z
update
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]