[GitHub] spark pull request #18459: [SPARK-13534][PYSPARK] Using Apache Arrow to incr...

BryanCutler Wed, 28 Jun 2017 10:48:46 -0700

GitHub user BryanCutler opened a pull request:

    https://github.com/apache/spark/pull/18459


    [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of 
DataFrame.toPandas

    ## What changes were proposed in this pull request?
    Integrate Apache Arrow with Spark to increase performance of 
`DataFrame.toPandas`.  This has been done by using Arrow to convert data 
partitions on the executor JVM to Arrow payload byte arrays where they are then 
served to the Python process.  The Python DataFrame can then collect the Arrow 
payloads where they are combined and converted to a Pandas DataFrame.  Data 
types except complex, date, timestamp, and decimal  are currently supported, 
otherwise an `UnsupportedOperation` exception is thrown.
    
    Additions to Spark include a Scala package private method 
`Dataset.toArrowPayload` that will convert data partitions in the executor JVM 
to `ArrowPayload`s as byte arrays so they can be easily served.  A package 
private class/object `ArrowConverters` that provide data type mappings and 
conversion routines.  In Python, a private method `DataFrame._collectAsArrow` 
is added to collect Arrow payloads and a SQLConf 
"spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using 
Arrow (uses the old conversion by default).
    
    ## How was this patch tested?
    Added a new test suite `ArrowConvertersSuite` that will run tests on 
conversion of Datasets to Arrow payloads for supported types.  The suite will 
generate a Dataset and matching Arrow JSON data, then the dataset is converted 
to an Arrow payload and finally validated against the JSON data.  This will 
ensure that the schema and data has been converted correctly.
    
    Added PySpark tests to verify the `toPandas` method is producing equal 
DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas 
DataFrame produced by pyspark is equal to a one made directly with pandas.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BryanCutler/spark 
toPandas_with_arrow-SPARK-13534

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18459.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18459
    
----
commit afd57398451d58aa37e92e2b5842e263c1e0705e
Author: Li Jin <[email protected]>
Date:   2016-12-12T21:50:51Z

    Test suite prototyping for collectAsArrow
    
    Changed scope of arrow-tools dependency to test
    
    commented out lines to Integration.compareXX that are private to arrow
    
    closes #10

commit f681d524f8f6986d2e05851814d67d2a3a858f0e
Author: Bryan Cutler <[email protected]>
Date:   2016-12-14T21:42:02Z

    Inital attempt to integrate Arrow for use in dataframe.toPandas.  
Conversion has basic data types and is working for small datasets with longs, 
doubles.  Using Arrow 0.1.1-SNAPSHOT dependency.

commit a4b958e6b149c0734f9c70de2defd8807a3e4972
Author: Li Jin <[email protected]>
Date:   2017-01-05T22:21:19Z

    Test compiling against the newest arrow; Fix validity map; Add benchmark 
script
    
    Remove arrow-tools dependency
    
    changed zipWithIndex to while loop
    
    modified benchmark to work with Python2 timeit
    
    closes #13

commit be508a587e25d51aaa755bd6c6e74795b5287645
Author: Li Jin <[email protected]>
Date:   2017-01-12T04:38:15Z

    Fix conversion for String type; refactor related functions to Arrow.scala
    
    changed tests to use existing SQLTestData and removed unused files
    
    closes #14

commit 5dbad2241f318ce4926d2dc7446dbd81e092d8a3
Author: Bryan Cutler <[email protected]>
Date:   2017-01-12T22:40:32Z

    Moved test data files to a sub-dir for arrow, merged dataType matching and 
cleanup
    
    closes #15

commit 5837b38e5f7a3d8e31c6f06e64c1c7139d40a46a
Author: Bryan Cutler <[email protected]>
Date:   2017-01-14T00:16:05Z

    added some python unit tests
    
    added more conversion tests
    
    short type should have a bit-width of 16
    
    closes #17

commit bdba357eef31b3225ea6a565e7841d3459822e97
Author: Li Jin <[email protected]>
Date:   2017-01-17T21:47:07Z

    Implement Arrow column writers
    
    Move column writers to Arrow.scala
    
    Add support for more types; Switch to arrow NullableVector
    
    closes #16

commit d20437f37253f565a4f2647a7a5768a525678db1
Author: Bryan Cutler <[email protected]>
Date:   2017-01-20T01:52:26Z

    added bool type converstion test
    
    added test for byte data
    
    byte type should be signed
    
    closes #18

commit 1ce4f2d30de03833b56e24d06c81d79322aca6ef
Author: Li Jin <[email protected]>
Date:   2017-01-23T23:06:16Z

    Add support for date/timestamp/binary; Add more numbers to benchmark.py; 
Fix memory leaking bug
    
    closes #19

commit 2e81a93735d06d6fdbecd17747d85dcbe79d23bf
Author: Bryan Cutler <[email protected]>
Date:   2017-01-24T19:35:00Z

    changed scope of some functions and minor cleanup

commit ed1f0fabaec1b367b593af1a000bf40876e7816c
Author: Bryan Cutler <[email protected]>
Date:   2017-01-24T22:20:13Z

    Cleanup of changes before updating the PR for review
    
    remove unwanted changes
    
    removed benchmark.py from repository, will attach to PR instead

commit 202650ea6a7fb503bb375c531e1976bf480f4ed6
Author: Bryan Cutler <[email protected]>
Date:   2017-01-25T22:02:59Z

    Changed RootAllocator param to Option in collectAsArrow
    
    added more tests and cleanup
    
    closes #20

commit fbe3b7ce06c1322306f2e3f3db6e77ec60620189
Author: Bryan Cutler <[email protected]>
Date:   2017-01-27T00:25:40Z

    renamed to ArrowConverters
    
    defined ArrowPayload and encapsulated Arrow classes in ArrowConverters
    
    addressed some minor comments in code review
    
    closes #21

commit f44e6d74d728e430878586e0eec99c5ac6017e4e
Author: Wes McKinney <[email protected]>
Date:   2017-01-30T22:36:23Z

    Adjust to cleaned up pyarrow FileReader API, support multiple record 
batches in a stream
    
    closes #22

commit e0bf11b3b7e9669974090c8bfde5fd8a2a0629c0
Author: Bryan Cutler <[email protected]>
Date:   2017-02-03T22:53:58Z

    changed conversion to use Iterator[InternalRow] instead of Array
    
    arrow conversion done at partition by executors
    
    some cleanup of APIs, made tests complete for non-complex data types
    
    closes #23

commit 3090a3eeb0f8efb68187ee870c0fb3ea46b9f46e
Author: Bryan Cutler <[email protected]>
Date:   2017-02-22T01:44:57Z

    Changed tests to use generated JSON data instead of files

commit 54884ed502ad779edf212bb5652afe46f28bd921
Author: Bryan Cutler <[email protected]>
Date:   2017-02-23T00:21:00Z

    updated Arrow artifacts to 0.2.0 release

commit 42af1d59c6c8e6fbb0f38cd8d1b1afbe86076aa1
Author: Bryan Cutler <[email protected]>
Date:   2017-02-23T01:03:24Z

    fixed python style checks

commit 9c8ea63ccec4ceb9dd40834bca530a6265385579
Author: Bryan Cutler <[email protected]>
Date:   2017-02-23T01:17:38Z

    updated dependency manifest

commit b7c28ad19cc56553c8cdb53f98f92f189c9d7a27
Author: Bryan Cutler <[email protected]>
Date:   2017-02-24T00:29:30Z

    test format fix for python 2.6

commit 2851cd6ea9e37a9fa439bc1a7510ba62ea90b110
Author: Bryan Cutler <[email protected]>
Date:   2017-02-28T22:03:57Z

    fixed docstrings and added list of pyarrow supported types

commit f8f24abeb4fd3394a8aa4136b1dda6b6ae4cd323
Author: Bryan Cutler <[email protected]>
Date:   2017-03-03T22:38:03Z

    fixed memory leak of ArrowRecordBatch iterator getting consumed and batches 
not closed properly

commit b6c752b0fcc19f2c5795593d9b4a1511f574c83d
Author: Bryan Cutler <[email protected]>
Date:   2017-03-03T22:46:50Z

    changed _collectAsArrow to private method

commit cbab294cbf291bd0082139e6ca988909b06bdd1a
Author: Bryan Cutler <[email protected]>
Date:   2017-03-03T23:07:03Z

    added netty to exclusion list for arrow dependency

commit 44ca3ffc37711225a64fd57dca2ba747debf8c3d
Author: Bryan Cutler <[email protected]>
Date:   2017-03-07T18:21:46Z

    dict comprehensions not supported in python 2.6

commit 33b75b99b375292e1f4691983e37919d0a620725
Author: Bryan Cutler <[email protected]>
Date:   2017-03-10T21:12:45Z

    ensure payload batches are closed if any exception is thrown, some minor 
cleanup

commit 97742b8bc39aa735af098bba16d5355651b87025
Author: Bryan Cutler <[email protected]>
Date:   2017-03-13T18:40:56Z

    changed comment for readable seekable byte channel class

commit b821077b29daffffc5a931a89989fbd73fc90d68
Author: Li Jin <[email protected]>
Date:   2017-03-28T01:09:30Z

    Remove Date and Timestamp from supported types
    
    closes #24

commit 3d786a2e1b01a697678d1f4868b9e0354fe13333
Author: Bryan Cutler <[email protected]>
Date:   2017-04-03T21:22:44Z

    Added scaladocs to methods that did not have it

commit cb4c510fc5c44ffe8bff0b4d57e0ca4e2af48a8a
Author: Bryan Cutler <[email protected]>
Date:   2017-04-03T22:01:19Z

    added check for pyarrow import error

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18459: [SPARK-13534][PYSPARK] Using Apache Arrow to incr...

Reply via email to