GitHub user BryanCutler opened a pull request: https://github.com/apache/spark/pull/21427
[SPARK-24324][PYTHON] Pandas Grouped Map UDF should assign result columns by name ## What changes were proposed in this pull request? Currently, a `pandas_udf` of type `PandasUDFType.GROUPED_MAP` will assign the resulting columns based on index of the return pandas.DataFrame. If a new DataFrame is returned and constructed using a dict, then the order of the columns could be arbitrary and be different than the defined schema for the UDF. If the schema types still match, then no error will be raised and the user will see column names and column data mixed up. ## How was this patch tested? Added a test that returns a new DataFrame with column order different than the schema. You can merge this pull request into a Git repository by running: $ git pull https://github.com/BryanCutler/spark arrow-grouped-map-mixesup-cols-SPARK-24324 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21427.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21427 ---- commit 0641c5a0cd690fec905829b70006de3f8a4902fc Author: Bryan Cutler <cutlerb@...> Date: 2018-05-24T19:03:02Z added test for diff column order commit 8484647113144958c8ebcf3611c222119047cc96 Author: Bryan Cutler <cutlerb@...> Date: 2018-05-24T19:18:47Z needed to adjust expected values to compare results commit d67a8a5987d6ba4bdd65f5d5decafca2d22291ad Author: Bryan Cutler <cutlerb@...> Date: 2018-05-24T19:21:36Z for grouped map results, get columns based on name instead of position ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org