GitHub user BryanCutler opened a pull request:
https://github.com/apache/spark/pull/21427
[SPARK-24324][PYTHON] Pandas Grouped Map UDF should assign result columns
by name
## What changes were proposed in this pull request?
Currently, a `pandas_udf` of type `PandasUDFType.GROUPED_MAP` will assign
the resulting columns based on index of the return pandas.DataFrame. If a new
DataFrame is returned and constructed using a dict, then the order of the
columns could be arbitrary and be different than the defined schema for the
UDF. If the schema types still match, then no error will be raised and the
user will see column names and column data mixed up.
## How was this patch tested?
Added a test that returns a new DataFrame with column order different than
the schema.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/BryanCutler/spark
arrow-grouped-map-mixesup-cols-SPARK-24324
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21427.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21427
----
commit 0641c5a0cd690fec905829b70006de3f8a4902fc
Author: Bryan Cutler <cutlerb@...>
Date: 2018-05-24T19:03:02Z
added test for diff column order
commit 8484647113144958c8ebcf3611c222119047cc96
Author: Bryan Cutler <cutlerb@...>
Date: 2018-05-24T19:18:47Z
needed to adjust expected values to compare results
commit d67a8a5987d6ba4bdd65f5d5decafca2d22291ad
Author: Bryan Cutler <cutlerb@...>
Date: 2018-05-24T19:21:36Z
for grouped map results, get columns based on name instead of position
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]