Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20295#discussion_r171285307
--- Diff: python/pyspark/sql/functions.py ---
@@ -2253,6 +2253,30 @@ def pandas_udf(f=None, returnType=None,
functionType=None):
| 2| 1.1094003924504583|
+---+-------------------+
+ Alternatively, the user can define a function that takes two
arguments.
+ In this case, the grouping key will be passed as the first argument
and the data will
+ be passed as the second argument. The grouping key will be passed
as a tuple of numpy
+ data types, e.g., `numpy.int32` and `numpy.float64`. The data will
still be passed in
+ as a `pandas.DataFrame` containing all columns from the original
Spark DataFrame.
+ This is useful when the user doesn't want to hardcode grouping key
in the function.
--- End diff --
I usually avoid abbreviation like `doesn't` in doc but I am not sure if
this actually matters though.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]