Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/19575#discussion_r163964598
--- Diff: docs/sql-programming-guide.md ---
@@ -1693,70 +1693,70 @@ Using the above optimizations with Arrow will
produce the same results as when A
enabled. Not all Spark data types are currently supported and an error
will be raised if a column
has an unsupported type, see [Supported Types](#supported-types).
-## How to Write Vectorized UDFs
+## Pandas UDFs (a.k.a Vectorized UDFs)
-A vectorized UDF is similar to a standard UDF in Spark except the inputs
and output will be
-Pandas Series, which allow the function to be composed with vectorized
operations. This function
-can then be run very efficiently in Spark where data is sent in batches to
Python and then
-is executed using Pandas Series as the inputs. The exected output of the
function is also a Pandas
-Series of the same length as the inputs. A vectorized UDF is declared
using the `pandas_udf`
-keyword, no additional configuration is required.
+With Arrow, we introduce a new type of UDF - pandas UDF. Pandas UDF is
defined with a new function
+`pyspark.sql.functions.pandas_udf` and allows user to use functions that
operate on `pandas.Series`
+and `pandas.DataFrame` with Spark. Currently, there are two types of
pandas UDF: Scalar and Group Map.
-The following example shows how to create a vectorized UDF that computes
the product of 2 columns.
+### Scalar
+
+Scalar pandas UDFs are used for vectorizing scalar operations. They can
used with functions such as `select`
+and `withColumn`. To define a scalar pandas UDF, use `pandas_udf` to
annotate a Python function. The Python
+should takes `pandas.Series` and returns a `pandas.Series` of the same
size. Internally, Spark will
+split a column into multiple `pandas.Series` and invoke the Python
function with each `pandas.Series`, and
+concat the results together to be a new column.
+
+The following example shows how to create a scalar pandas UDF that
computes the product of 2 columns.
<div class="codetabs">
<div data-lang="python" markdown="1">
{% highlight python %}
import pandas as pd
-from pyspark.sql.functions import col, pandas_udf
-from pyspark.sql.types import LongType
+from pyspark.sql.functions import pandas_udf, PandasUDFTypr
+
+df = spark.createDataFrame(
+ [(1,), (2,), (3,)],
+ ['v'])
# Declare the function and create the UDF
-def multiply_func(a, b):
+@pandas_udf('long', PandasUDFType.SCALAR)
+def multiply_udf(a, b):
+ # a and b are both pandas.Series
return a * b
-multiply = pandas_udf(multiply_func, returnType=LongType())
-
-# The function for a pandas_udf should be able to execute with local
Pandas data
-x = pd.Series([1, 2, 3])
-print(multiply_func(x, x))
-# 0 1
-# 1 4
-# 2 9
-# dtype: int64
-
-# Create a Spark DataFrame, 'spark' is an existing SparkSession
-df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
-
-# Execute function as a Spark vectorized UDF
-df.select(multiply(col("x"), col("x"))).show()
-# +-------------------+
-# |multiply_func(x, x)|
-# +-------------------+
-# | 1|
-# | 4|
-# | 9|
-# +-------------------+
+df.select(multiply_udf(df.v, df.v)).show()
+# +------------------+
+# |multiply_udf(v, v)|
+# +------------------+
+# | 1|
+# | 4|
+# | 9|
+# +------------------+
{% endhighlight %}
</div>
</div>
-## GroupBy-Apply
-GroupBy-Apply implements the "split-apply-combine" pattern.
Split-apply-combine consists of three steps:
+Note that there are two important requirement when using scalar pandas
UDFs:
+* The input and output series must have the same size.
+* How a column is splitted into multiple `pandas.Series` is internal to
Spark, and therefore the result
+ of user-defined function must be independent of the splitting.
--- End diff --
Yeah I don't know a great way of saying this. Do you have any suggestions?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]