[GitHub] spark pull request #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to...

BryanCutler Tue, 08 Nov 2016 17:14:07 -0800

GitHub user BryanCutler opened a pull request:

    https://github.com/apache/spark/pull/15821


    [SPARK-13534][WIP][PySpark] Using Apache Arrow to increase performance of 
DataFrame.toPandas

    ## What changes were proposed in this pull request?
    WIP to integrate Apache Arrow with Spark to increase performance of 
DataFrame.toPandas
    
    ## How was this patch tested?
    Added new unittests, conditional on if PyArrow is installed

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BryanCutler/spark 
wip-toPandas_with_arrow-SPARK-13534

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15821.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15821
    
----
commit 155c26db728391a45f8d890413c1393cd9655d82
Author: Bryan Cutler <[email protected]>
Date:   2016-10-24T17:54:16Z

    Added Arrow 0.1.0 as dependency to Spark-SQL and stub for collectAsArrow()

commit d556970429355b92844c12446406461fc6ae3ca4
Author: Bryan Cutler <[email protected]>
Date:   2016-10-24T18:28:19Z

    added function collectAsArrowToPython for python API

commit a6171d7b0ea83b39025f6febfbaa0688618b678f
Author: Bryan Cutler <[email protected]>
Date:   2016-10-26T19:27:56Z

    Changed conversion to use ArrowRecordBatch in Scala

commit 54ac65b397076d63afd201765578372319309adf
Author: Bryan Cutler <[email protected]>
Date:   2016-10-26T21:39:17Z

    added framework for PySpark toPandas using pyarrow, untested

commit 4f9943ff10fcd82640661ea5f29f3f908b3cf1b0
Author: Xusen Yin <[email protected]>
Date:   2016-10-27T04:49:09Z

    prototype code to reuse serveIterator and _load_from_socket

commit da5ea5b4ebda86d40167cb03e7f8fb3f744cf2f7
Author: Bryan Cutler <[email protected]>
Date:   2016-10-27T06:07:44Z

    Merge pull request #3 from yinxusen/wip-toPandas_with_arrow-SPARK-13534
    
    reusable collect

commit 89f87c17520470dfb1e25b73b21e6f355627deec
Author: Bryan Cutler <[email protected]>
Date:   2016-10-31T18:02:16Z

    Fixed compilation error, now using case class intead of tuple in iterator

commit 20b5aa0f07037c06cbf1bda1a1940d960e145fb2
Author: Bryan Cutler <[email protected]>
Date:   2016-10-31T18:24:14Z

    removed unused imports

commit 3c112f6d3dbe2a4c5b76b04bc57999484b5959ac
Author: Bryan Cutler <[email protected]>
Date:   2016-10-31T23:38:25Z

    added dependency exclusions for arrow-vector, uses conflicting jackson lib

commit 06b19caff4e0174c92501ea0942bffd8b63ea3a2
Author: Bryan Cutler <[email protected]>
Date:   2016-11-01T00:28:07Z

    started adding roundtrip test using DataFrame.toPandas

commit 1657928d9878e6f8769adc5d50802053eb660af5
Author: Bryan Cutler <[email protected]>
Date:   2016-11-01T00:31:56Z

    fixed some minor issues with toPandas pipeline

commit 842b620a19e3f51e6c9f3bdaaa69e3a5bf0f382b
Author: Bryan Cutler <[email protected]>
Date:   2016-11-01T00:58:19Z

    fixed spacing

commit 795684e9f4e8fd0458fd4a056e2d68417fb37306
Author: Bryan Cutler <[email protected]>
Date:   2016-11-01T19:46:04Z

    Need to pass instance of ArrowSerializer to _load_from_socket

commit 39b4f02c764c14bea46ca91ba3c2e0143543b53c
Author: Bryan Cutler <[email protected]>
Date:   2016-11-03T23:55:52Z

    Simplified by serving byte array to PythonRDD

commit f4ed56fdeafe4805d8541f36aa876c3c4c530436
Author: Bryan Cutler <[email protected]>
Date:   2016-11-04T00:51:26Z

    added in hard-coded test batch, not quite working

commit af63fc263f4a6a5dd59d7b3b8d8ccee44fd67cc9
Author: Bryan Cutler <[email protected]>
Date:   2016-11-08T01:02:13Z

    pyspark df had incorrect method name for ArrowRecordBatch.to_pandas()

commit 8180a95f1178ace76643fd5e57302185bb8b730e
Author: Bryan Cutler <[email protected]>
Date:   2016-11-08T01:04:23Z

    added test to exercise the Python API of Arrow by sending a bytearray 
through Spark DataFrame without serialization

commit 35a32c2bda210cd3aeeb207bcb9dc55b112c17ae
Author: Bryan Cutler <[email protected]>
Date:   2016-11-09T01:02:22Z

    moved pyarrow import, since it will most likely not be a required 
dependency.  some minor cleanup

commit 275602c9fe83e1eb595ca7ae70dcc340b99a8916
Author: Bryan Cutler <[email protected]>
Date:   2016-11-09T01:07:43Z

    accidentally removed newlines in import

commit a35a441ce01ec1cc95d384b6a008780a61768b08
Author: Bryan Cutler <[email protected]>
Date:   2016-11-09T01:08:48Z

    Arrow no longer a dependency in PythonRDD

commit 4227ec6696af2289c47eb0bd0f7872d50f69d302
Author: Bryan Cutler <[email protected]>
Date:   2016-11-09T01:12:03Z

    minor cleanup of test case

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to...

Reply via email to