Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/21471#discussion_r192245555
--- Diff: docs/sql-programming-guide.md ---
@@ -1752,6 +1752,15 @@ To use `groupBy().apply()`, the user needs to define
the following:
* A Python function that defines the computation for each group.
* A `StructType` object or a string that defines the schema of the output
`DataFrame`.
+The output schema will be applied to the columns of the returned
`pandas.DataFrame` in order by position,
+not by name. This means that the columns in the `pandas.DataFrame` must be
indexed so that their
+position matches the corresponding field in the schema.
+
+Note that when creating a new `pandas.DataFrame` using a dictionary, the
actual position of the column
+can differ from the order that it was placed in the dictionary. It is
recommended in this case to
+explicitly define the column order using the `columns` keyword, e.g.
+`pandas.DataFrame({'id': ids, 'a': data}, columns=['id', 'a'])`.
--- End diff --
It's not ideal to list the column names, but I think it makes it clear that
you can't rely on the dictionary order. I'll add OrderedDict there too, to show
there is more than one way.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]