GitHub user BryanCutler opened a pull request:
https://github.com/apache/spark/pull/18459
[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of
DataFrame.toPandas
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of
`DataFrame.toPandas`. This has been done by using Arrow to convert data
partitions on the executor JVM to Arrow payload byte arrays where they are then
served to the Python process. The Python DataFrame can then collect the Arrow
payloads where they are combined and converted to a Pandas DataFrame. Data
types except complex, date, timestamp, and decimal are currently supported,
otherwise an `UnsupportedOperation` exception is thrown.
Additions to Spark include a Scala package private method
`Dataset.toArrowPayload` that will convert data partitions in the executor JVM
to `ArrowPayload`s as byte arrays so they can be easily served. A package
private class/object `ArrowConverters` that provide data type mappings and
conversion routines. In Python, a private method `DataFrame._collectAsArrow`
is added to collect Arrow payloads and a SQLConf
"spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using
Arrow (uses the old conversion by default).
## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on
conversion of Datasets to Arrow payloads for supported types. The suite will
generate a Dataset and matching Arrow JSON data, then the dataset is converted
to an Arrow payload and finally validated against the JSON data. This will
ensure that the schema and data has been converted correctly.
Added PySpark tests to verify the `toPandas` method is producing equal
DataFrames with and without pyarrow. A roundtrip test to ensure the pandas
DataFrame produced by pyspark is equal to a one made directly with pandas.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/BryanCutler/spark
toPandas_with_arrow-SPARK-13534
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18459.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18459
----
commit afd57398451d58aa37e92e2b5842e263c1e0705e
Author: Li Jin <[email protected]>
Date: 2016-12-12T21:50:51Z
Test suite prototyping for collectAsArrow
Changed scope of arrow-tools dependency to test
commented out lines to Integration.compareXX that are private to arrow
closes #10
commit f681d524f8f6986d2e05851814d67d2a3a858f0e
Author: Bryan Cutler <[email protected]>
Date: 2016-12-14T21:42:02Z
Inital attempt to integrate Arrow for use in dataframe.toPandas.
Conversion has basic data types and is working for small datasets with longs,
doubles. Using Arrow 0.1.1-SNAPSHOT dependency.
commit a4b958e6b149c0734f9c70de2defd8807a3e4972
Author: Li Jin <[email protected]>
Date: 2017-01-05T22:21:19Z
Test compiling against the newest arrow; Fix validity map; Add benchmark
script
Remove arrow-tools dependency
changed zipWithIndex to while loop
modified benchmark to work with Python2 timeit
closes #13
commit be508a587e25d51aaa755bd6c6e74795b5287645
Author: Li Jin <[email protected]>
Date: 2017-01-12T04:38:15Z
Fix conversion for String type; refactor related functions to Arrow.scala
changed tests to use existing SQLTestData and removed unused files
closes #14
commit 5dbad2241f318ce4926d2dc7446dbd81e092d8a3
Author: Bryan Cutler <[email protected]>
Date: 2017-01-12T22:40:32Z
Moved test data files to a sub-dir for arrow, merged dataType matching and
cleanup
closes #15
commit 5837b38e5f7a3d8e31c6f06e64c1c7139d40a46a
Author: Bryan Cutler <[email protected]>
Date: 2017-01-14T00:16:05Z
added some python unit tests
added more conversion tests
short type should have a bit-width of 16
closes #17
commit bdba357eef31b3225ea6a565e7841d3459822e97
Author: Li Jin <[email protected]>
Date: 2017-01-17T21:47:07Z
Implement Arrow column writers
Move column writers to Arrow.scala
Add support for more types; Switch to arrow NullableVector
closes #16
commit d20437f37253f565a4f2647a7a5768a525678db1
Author: Bryan Cutler <[email protected]>
Date: 2017-01-20T01:52:26Z
added bool type converstion test
added test for byte data
byte type should be signed
closes #18
commit 1ce4f2d30de03833b56e24d06c81d79322aca6ef
Author: Li Jin <[email protected]>
Date: 2017-01-23T23:06:16Z
Add support for date/timestamp/binary; Add more numbers to benchmark.py;
Fix memory leaking bug
closes #19
commit 2e81a93735d06d6fdbecd17747d85dcbe79d23bf
Author: Bryan Cutler <[email protected]>
Date: 2017-01-24T19:35:00Z
changed scope of some functions and minor cleanup
commit ed1f0fabaec1b367b593af1a000bf40876e7816c
Author: Bryan Cutler <[email protected]>
Date: 2017-01-24T22:20:13Z
Cleanup of changes before updating the PR for review
remove unwanted changes
removed benchmark.py from repository, will attach to PR instead
commit 202650ea6a7fb503bb375c531e1976bf480f4ed6
Author: Bryan Cutler <[email protected]>
Date: 2017-01-25T22:02:59Z
Changed RootAllocator param to Option in collectAsArrow
added more tests and cleanup
closes #20
commit fbe3b7ce06c1322306f2e3f3db6e77ec60620189
Author: Bryan Cutler <[email protected]>
Date: 2017-01-27T00:25:40Z
renamed to ArrowConverters
defined ArrowPayload and encapsulated Arrow classes in ArrowConverters
addressed some minor comments in code review
closes #21
commit f44e6d74d728e430878586e0eec99c5ac6017e4e
Author: Wes McKinney <[email protected]>
Date: 2017-01-30T22:36:23Z
Adjust to cleaned up pyarrow FileReader API, support multiple record
batches in a stream
closes #22
commit e0bf11b3b7e9669974090c8bfde5fd8a2a0629c0
Author: Bryan Cutler <[email protected]>
Date: 2017-02-03T22:53:58Z
changed conversion to use Iterator[InternalRow] instead of Array
arrow conversion done at partition by executors
some cleanup of APIs, made tests complete for non-complex data types
closes #23
commit 3090a3eeb0f8efb68187ee870c0fb3ea46b9f46e
Author: Bryan Cutler <[email protected]>
Date: 2017-02-22T01:44:57Z
Changed tests to use generated JSON data instead of files
commit 54884ed502ad779edf212bb5652afe46f28bd921
Author: Bryan Cutler <[email protected]>
Date: 2017-02-23T00:21:00Z
updated Arrow artifacts to 0.2.0 release
commit 42af1d59c6c8e6fbb0f38cd8d1b1afbe86076aa1
Author: Bryan Cutler <[email protected]>
Date: 2017-02-23T01:03:24Z
fixed python style checks
commit 9c8ea63ccec4ceb9dd40834bca530a6265385579
Author: Bryan Cutler <[email protected]>
Date: 2017-02-23T01:17:38Z
updated dependency manifest
commit b7c28ad19cc56553c8cdb53f98f92f189c9d7a27
Author: Bryan Cutler <[email protected]>
Date: 2017-02-24T00:29:30Z
test format fix for python 2.6
commit 2851cd6ea9e37a9fa439bc1a7510ba62ea90b110
Author: Bryan Cutler <[email protected]>
Date: 2017-02-28T22:03:57Z
fixed docstrings and added list of pyarrow supported types
commit f8f24abeb4fd3394a8aa4136b1dda6b6ae4cd323
Author: Bryan Cutler <[email protected]>
Date: 2017-03-03T22:38:03Z
fixed memory leak of ArrowRecordBatch iterator getting consumed and batches
not closed properly
commit b6c752b0fcc19f2c5795593d9b4a1511f574c83d
Author: Bryan Cutler <[email protected]>
Date: 2017-03-03T22:46:50Z
changed _collectAsArrow to private method
commit cbab294cbf291bd0082139e6ca988909b06bdd1a
Author: Bryan Cutler <[email protected]>
Date: 2017-03-03T23:07:03Z
added netty to exclusion list for arrow dependency
commit 44ca3ffc37711225a64fd57dca2ba747debf8c3d
Author: Bryan Cutler <[email protected]>
Date: 2017-03-07T18:21:46Z
dict comprehensions not supported in python 2.6
commit 33b75b99b375292e1f4691983e37919d0a620725
Author: Bryan Cutler <[email protected]>
Date: 2017-03-10T21:12:45Z
ensure payload batches are closed if any exception is thrown, some minor
cleanup
commit 97742b8bc39aa735af098bba16d5355651b87025
Author: Bryan Cutler <[email protected]>
Date: 2017-03-13T18:40:56Z
changed comment for readable seekable byte channel class
commit b821077b29daffffc5a931a89989fbd73fc90d68
Author: Li Jin <[email protected]>
Date: 2017-03-28T01:09:30Z
Remove Date and Timestamp from supported types
closes #24
commit 3d786a2e1b01a697678d1f4868b9e0354fe13333
Author: Bryan Cutler <[email protected]>
Date: 2017-04-03T21:22:44Z
Added scaladocs to methods that did not have it
commit cb4c510fc5c44ffe8bff0b4d57e0ca4e2af48a8a
Author: Bryan Cutler <[email protected]>
Date: 2017-04-03T22:01:19Z
added check for pyarrow import error
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]