GitHub user BryanCutler opened a pull request:

    https://github.com/apache/spark/pull/21427

    [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assign result columns 
by name

    ## What changes were proposed in this pull request?
    
    Currently, a `pandas_udf` of type `PandasUDFType.GROUPED_MAP` will assign 
the resulting columns based on index of the return pandas.DataFrame.  If a new 
DataFrame is returned and constructed using a dict, then the order of the 
columns could be arbitrary and be different than the defined schema for the 
UDF.  If the schema types still match, then no error will be raised and the 
user will see column names and column data mixed up.
    
    ## How was this patch tested?
    
    Added a test that returns a new DataFrame with column order different than 
the schema.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BryanCutler/spark 
arrow-grouped-map-mixesup-cols-SPARK-24324

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21427.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21427
    
----
commit 0641c5a0cd690fec905829b70006de3f8a4902fc
Author: Bryan Cutler <cutlerb@...>
Date:   2018-05-24T19:03:02Z

    added test for diff column order

commit 8484647113144958c8ebcf3611c222119047cc96
Author: Bryan Cutler <cutlerb@...>
Date:   2018-05-24T19:18:47Z

    needed to adjust expected values to compare results

commit d67a8a5987d6ba4bdd65f5d5decafca2d22291ad
Author: Bryan Cutler <cutlerb@...>
Date:   2018-05-24T19:21:36Z

    for grouped map results, get columns based on name instead of position

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to