GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/11347

    [SPARK-13233][SQL][WIP] Python Dataset (basic version)

    ## What changes were proposed in this pull request?
    
    This PR introduce a new API: Python dataset. Conceptually it's a 
combination of Python DataFrame and Python RDD, supports both typed 
operations(e.g. map, flatMap, filter, etc.) and untyped operations(e.g. select, 
sort, etc.). This is a simpler version of 
https://github.com/apache/spark/pull/11117, without the aggregate part.
    
    
    ## How was this patch tested?
    
    new tests are added in pyspark/sql/tests.py
    
    
    TODO:
    
    * add documents
    * more tests
    * fix all corner cases


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark pydataset

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11347.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11347
    
----
commit ddfdbbdf9c052f078dd914dfa6ae54de6c633d46
Author: Wenchen Fan <[email protected]>
Date:   2016-02-24T09:19:11Z

    tmp

commit fb0e7f497538390e631d16c9d27ef3c03e4e4b8a
Author: Wenchen Fan <[email protected]>
Date:   2016-02-24T13:58:52Z

    python dataset

commit a073f831adfd7640c40f9455ddcc02b9667db9ad
Author: Wenchen Fan <[email protected]>
Date:   2016-02-24T14:19:25Z

    update

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to