Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19505#discussion_r144917109
--- Diff: python/pyspark/sql/functions.py ---
@@ -2192,67 +2205,82 @@ def pandas_udf(f=None, returnType=StringType()):
:param f: user-defined function. A python function if used as a
standalone function
:param returnType: a :class:`pyspark.sql.types.DataType` object
- The user-defined function can define one of the following
transformations:
-
- 1. One or more `pandas.Series` -> A `pandas.Series`
-
- This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
- :meth:`pyspark.sql.DataFrame.select`.
- The returnType should be a primitive data type, e.g.,
`DoubleType()`.
- The length of the returned `pandas.Series` must be of the same as
the input `pandas.Series`.
-
- >>> from pyspark.sql.types import IntegerType, StringType
- >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
- >>> @pandas_udf(returnType=StringType())
- ... def to_upper(s):
- ... return s.str.upper()
- ...
- >>> @pandas_udf(returnType="integer")
- ... def add_one(x):
- ... return x + 1
- ...
- >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id",
"name", "age"))
- >>> df.select(slen("name").alias("slen(name)"), to_upper("name"),
add_one("age")) \\
- ... .show() # doctest: +SKIP
- +----------+--------------+------------+
- |slen(name)|to_upper(name)|add_one(age)|
- +----------+--------------+------------+
- | 8| JOHN DOE| 22|
- +----------+--------------+------------+
-
- 2. A `pandas.DataFrame` -> A `pandas.DataFrame`
-
- This udf is only used with :meth:`pyspark.sql.GroupedData.apply`.
- The returnType should be a :class:`StructType` describing the
schema of the returned
- `pandas.DataFrame`.
-
- >>> df = spark.createDataFrame(
- ... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
- ... ("id", "v"))
- >>> @pandas_udf(returnType=df.schema)
- ... def normalize(pdf):
- ... v = pdf.v
- ... return pdf.assign(v=(v - v.mean()) / v.std())
- >>> df.groupby('id').apply(normalize).show() # doctest: +SKIP
- +---+-------------------+
- | id| v|
- +---+-------------------+
- | 1|-0.7071067811865475|
- | 1| 0.7071067811865475|
- | 2|-0.8320502943378437|
- | 2|-0.2773500981126146|
- | 2| 1.1094003924504583|
- +---+-------------------+
-
- .. note:: This type of udf cannot be used with functions such as
`withColumn` or `select`
- because it defines a `DataFrame` transformation rather
than a `Column`
- transformation.
-
- .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
+ The user-defined function can define the following transformation:
+
+ One or more `pandas.Series` -> A `pandas.Series`
+
+ This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
+ :meth:`pyspark.sql.DataFrame.select`.
+ The returnType should be a primitive data type, e.g., `DoubleType()`.
--- End diff --
What happened if we do not pass a primitive data type? Do we have a test
case for this?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]