[GitHub] spark pull request #21471: [SPARK-24444][DOCS][PYTHON] Improve Pandas UDF do...

BryanCutler Thu, 31 May 2018 14:38:05 -0700

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21471#discussion_r192245555
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1752,6 +1752,15 @@ To use `groupBy().apply()`, the user needs to define 
the following:
     * A Python function that defines the computation for each group.
     * A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
     
    +The output schema will be applied to the columns of the returned 
`pandas.DataFrame` in order by position,
    +not by name. This means that the columns in the `pandas.DataFrame` must be 
indexed so that their
    +position matches the corresponding field in the schema.
    +
    +Note that when creating a new `pandas.DataFrame` using a dictionary, the 
actual position of the column
    +can differ from the order that it was placed in the dictionary. It is 
recommended in this case to
    +explicitly define the column order using the `columns` keyword, e.g.
    +`pandas.DataFrame({'id': ids, 'a': data}, columns=['id', 'a'])`.
    --- End diff --
    
    It's not ideal to list the column names, but I think it makes it clear that 
you can't rely on the dictionary order. I'll add OrderedDict there too, to show 
there is more than one way.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21471: [SPARK-24444][DOCS][PYTHON] Improve Pandas UDF do...

Reply via email to