GitHub user BryanCutler opened a pull request:
https://github.com/apache/spark/pull/20114
[SPARK-22530][PYTHON][SQL] Adding Arrow support for ArrayType
## What changes were proposed in this pull request?
This change adds `ArrayType` support for working with Arrow in pyspark when
creating a DataFrame, calling `toPandas()`, and using vectorized `pandas_udf`.
## How was this patch tested?
Added new Python unit tests using Array data.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/BryanCutler/spark
arrow-ArrayType-support-SPARK-22530
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20114.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20114
----
commit 50fa54c5b04455729b019c660ab8e86c903bda44
Author: Bryan Cutler <cutlerb@...>
Date: 2017-11-15T23:44:23Z
wip, toPandas works with pyarrow 0.7.1
commit a149352d0c60882bb6692cd43d2fb60c8dddb07b
Author: Bryan Cutler <cutlerb@...>
Date: 2017-12-01T20:02:16Z
createDataFrame test now working
commit 36faab4d7a23421968e1885dc6f2f47ac20c0ce0
Author: Bryan Cutler <cutlerb@...>
Date: 2017-12-23T08:21:34Z
using is_list to check type
commit b0c79f108acf3ca91dd931bb9be45e4bbcf840a6
Author: Bryan Cutler <cutlerb@...>
Date: 2017-12-24T07:06:06Z
Using a workaround for ListVector validity buffer, ArrowTests passing
commit f1bc9a5d8ba09cf6d702269b2418697184ef5690
Author: Bryan Cutler <cutlerb@...>
Date: 2017-12-29T05:54:44Z
ArrayType working in vectorized udfs
commit d2c5c2b4ea803ac8d1f08a5f79af1076f9e5bd2b
Author: Bryan Cutler <cutlerb@...>
Date: 2017-12-29T06:04:19Z
fix import order
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]