Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r143800589
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,84 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col)
else:
jgd = self._jgd.pivot(pivot_col, values)
- return GroupedData(jgd, self.sql_ctx)
+ return GroupedData(jgd, self._df)
+
+ @since(2.3)
+ def apply(self, udf):
+ """
+ Maps each group of the current :class:`DataFrame` using a pandas
udf and returns the result
+ as a `DataFrame`.
+
+ The user-defined function should take a `pandas.DataFrame` and
return another
+ `pandas.DataFrame`. For each group, all columns are passed
together as a `pandas.DataFrame`
+ to the user-function and the returned `pandas.DataFrame` are
combined as a `DataFrame`.
+ The returned `pandas.DataFrame` can be arbitrary length and its
schema must match the
+ returnType of the pandas udf.
+
+ This function does not support partial aggregation, and requires
shuffling all the data in
+ the `DataFrame`.
+
+ :param udf: A wrapped udf function returned by
:meth:`pyspark.sql.functions.pandas_udf`
--- End diff --
I think "A wrapped udf" might be confusing to the user, how about just
saying "A `pandas_udf` returned by..."?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]