[GitHub] [spark] HyukjinKwon commented on a change in pull request #27466: [SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints

GitBox Wed, 05 Feb 2020 04:04:14 -0800

HyukjinKwon commented on a change in pull request #27466: 
[SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python 
type hints
URL: https://github.com/apache/spark/pull/27466#discussion_r375216428


 ##########
 File path: python/pyspark/sql/pandas/functions.py
 ##########
 @@ -43,303 +43,186 @@ class PandasUDFType(object):
 @since(2.3)
 def pandas_udf(f=None, returnType=None, functionType=None):
     """
-    Creates a vectorized user defined function (UDF).
+    Creates a pandas user defined function (a.k.a. vectorized user defined 
function).
+
+    Pandas UDFs are user defined functions that are executed by Spark using 
Arrow to transfer
+    data and Pandas to work with the data, which allows vectorized operations. 
A Pandas UDF
+    is defined using the `pandas_udf` as a decorator or to wrap the function, 
and no
+    additional configuration is required. A Pandas UDF behaves as a regular 
PySpark function
+    API in general.
 
     :param f: user-defined function. A python function if used as a standalone 
function
     :param returnType: the return type of the user-defined function. The value 
can be either a
         :class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
     :param functionType: an enum value in 
:class:`pyspark.sql.functions.PandasUDFType`.
-                         Default: SCALAR.
-
-    .. seealso:: :meth:`pyspark.sql.DataFrame.mapInPandas`
-    .. seealso:: :meth:`pyspark.sql.GroupedData.applyInPandas`
-    .. seealso:: :meth:`pyspark.sql.PandasCogroupedOps.applyInPandas`
-
-    The function type of the UDF can be one of the following:
-
-    1. SCALAR
-
-       A scalar UDF defines a transformation: One or more `pandas.Series` -> A 
`pandas.Series`.
-       The length of the returned `pandas.Series` must be of the same as the 
input `pandas.Series`.
-       If the return type is :class:`StructType`, the returned value should be 
a `pandas.DataFrame`.
-
-       :class:`MapType`, nested :class:`StructType` are currently not 
supported as output types.
-
-       Scalar UDFs can be used with :meth:`pyspark.sql.DataFrame.withColumn` 
and
-       :meth:`pyspark.sql.DataFrame.select`.
-
-       >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
-       >>> from pyspark.sql.types import IntegerType, StringType
-       >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())  # doctest: 
+SKIP
-       >>> @pandas_udf(StringType())  # doctest: +SKIP
-       ... def to_upper(s):
-       ...     return s.str.upper()
-       ...
-       >>> @pandas_udf("integer", PandasUDFType.SCALAR)  # doctest: +SKIP
-       ... def add_one(x):
-       ...     return x + 1
-       ...
-       >>> df = spark.createDataFrame([(1, "John Doe", 21)],
-       ...                            ("id", "name", "age"))  # doctest: +SKIP
-       >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
-       ...     .show()  # doctest: +SKIP
-       +----------+--------------+------------+
-       |slen(name)|to_upper(name)|add_one(age)|
-       +----------+--------------+------------+
-       |         8|      JOHN DOE|          22|
-       +----------+--------------+------------+
-       >>> @pandas_udf("first string, last string")  # doctest: +SKIP
-       ... def split_expand(n):
-       ...     return n.str.split(expand=True)
-       >>> df.select(split_expand("name")).show()  # doctest: +SKIP
-       +------------------+
-       |split_expand(name)|
-       +------------------+
-       |       [John, Doe]|
-       +------------------+
-
-       .. note:: The length of `pandas.Series` within a scalar UDF is not that 
of the whole input
-           column, but is the length of an internal batch used for each call 
to the function.
-           Therefore, this can be used, for example, to ensure the length of 
each returned
 
 Review comment:
   I removed this example. This is already logical known since it says the 
length of input can be arbitrary and the lengths of input and output should be 
same.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27466: [SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints

Reply via email to