zero323 commented on a change in pull request #27466:
[SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python
type hints
URL: https://github.com/apache/spark/pull/27466#discussion_r375247139
##########
File path: docs/sql-pyspark-pandas-with-arrow.md
##########
@@ -65,132 +65,188 @@ Spark will fall back to create the DataFrame without
Arrow.
## Pandas UDFs (a.k.a. Vectorized UDFs)
-Pandas UDFs are user defined functions that are executed by Spark using Arrow
to transfer data and
-Pandas to work with the data. A Pandas UDF is defined using the keyword
`pandas_udf` as a decorator
-or to wrap the function, no additional configuration is required. Currently,
there are two types of
-Pandas UDF: Scalar and Grouped Map.
+Pandas UDFs are user defined functions that are executed by Spark using
+Arrow to transfer data and Pandas to work with the data, which allows
vectorized operations. A Pandas
+UDF is defined using the `pandas_udf` as a decorator or to wrap the function,
and no additional
+configuration is required. A Pandas UDF behaves as a regular PySpark function
API in general.
-### Scalar
+Before Spark 3.0, Pandas UDFs used to be defined with `PandasUDFType`. From
Spark 3.0
+with Python 3.6+, you can also use Python type hints. Using Python type hints
are preferred and the
+previous way will be deprecated in the future release.
-Scalar Pandas UDFs are used for vectorizing scalar operations. They can be
used with functions such
-as `select` and `withColumn`. The Python function should take `pandas.Series`
as inputs and return
-a `pandas.Series` of the same length. Internally, Spark will execute a Pandas
UDF by splitting
-columns into batches and calling the function for each batch as a subset of
the data, then
-concatenating the results together.
+The below combinations of the type hints are supported by Python type hints
for Pandas UDFs.
+Note that `pandas.DataFrame` is mapped to the column of `StructType`;
otherwise, `pandas.Series` is
+mapped in all occurrences below.
-The following example shows how to create a scalar Pandas UDF that computes
the product of 2 columns.
+### Series to Series
+
+The type hint can be expressed as `pandas.Series`, ... -> `pandas.Series`.
+
+By using `pandas_udf` with the function having such type hints, it creates a
Pandas UDF where the given
+function takes one or more `pandas.Series` and outputs one `pandas.Series`.
The output of the function should
+always be of the same length as the input. Internally, PySpark will execute a
Pandas UDF by splitting
+columns into batches and calling the function for each batch as a subset of
the data, then concatenating
+the results together.
+
+The following example shows how to create this Pandas UDF that computes the
product of 2 columns.
<div class="codetabs">
<div data-lang="python" markdown="1">
-{% include_example scalar_pandas_udf python/sql/arrow.py %}
+{% include_example ser_to_ser_pandas_udf python/sql/arrow.py %}
</div>
</div>
-### Scalar Iterator
+For detailed usage, please see
[`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf)
-Scalar iterator (`SCALAR_ITER`) Pandas UDF is the same as scalar Pandas UDF
above except that the
-underlying Python function takes an iterator of batches as input instead of a
single batch and,
-instead of returning a single output batch, it yields output batches or
returns an iterator of
-output batches.
-It is useful when the UDF execution requires initializing some states, e.g.,
loading an machine
-learning model file to apply inference to every input batch.
+### Iterator of Series to Iterator of Series
-The following example shows how to create scalar iterator Pandas UDFs:
+The type hint can be expressed as `Iterator[pandas.Series]` ->
`Iterator[pandas.Series]`.
Review comment:
Nitpick. It is more `Iterator[Union[Tuple[pandas.Series, ...],
pandas.Series]]` -> `Iterator[pandas.Series]`, isn't it? But I guess that's too
much...
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]