Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r163964479 --- Diff: docs/sql-programming-guide.md --- @@ -1693,70 +1693,70 @@ Using the above optimizations with Arrow will produce the same results as when A enabled. Not all Spark data types are currently supported and an error will be raised if a column has an unsupported type, see [Supported Types](#supported-types). -## How to Write Vectorized UDFs +## Pandas UDFs (a.k.a Vectorized UDFs) -A vectorized UDF is similar to a standard UDF in Spark except the inputs and output will be -Pandas Series, which allow the function to be composed with vectorized operations. This function -can then be run very efficiently in Spark where data is sent in batches to Python and then -is executed using Pandas Series as the inputs. The exected output of the function is also a Pandas -Series of the same length as the inputs. A vectorized UDF is declared using the `pandas_udf` -keyword, no additional configuration is required. +With Arrow, we introduce a new type of UDF - pandas UDF. Pandas UDF is defined with a new function +`pyspark.sql.functions.pandas_udf` and allows user to use functions that operate on `pandas.Series` +and `pandas.DataFrame` with Spark. Currently, there are two types of pandas UDF: Scalar and Group Map. -The following example shows how to create a vectorized UDF that computes the product of 2 columns. +### Scalar + +Scalar pandas UDFs are used for vectorizing scalar operations. They can used with functions such as `select` +and `withColumn`. To define a scalar pandas UDF, use `pandas_udf` to annotate a Python function. The Python +should takes `pandas.Series` and returns a `pandas.Series` of the same size. Internally, Spark will +split a column into multiple `pandas.Series` and invoke the Python function with each `pandas.Series`, and +concat the results together to be a new column. + +The following example shows how to create a scalar pandas UDF that computes the product of 2 columns. <div class="codetabs"> <div data-lang="python" markdown="1"> {% highlight python %} import pandas as pd -from pyspark.sql.functions import col, pandas_udf -from pyspark.sql.types import LongType +from pyspark.sql.functions import pandas_udf, PandasUDFTypr + +df = spark.createDataFrame( + [(1,), (2,), (3,)], + ['v']) # Declare the function and create the UDF -def multiply_func(a, b): +@pandas_udf('long', PandasUDFType.SCALAR) +def multiply_udf(a, b): + # a and b are both pandas.Series return a * b -multiply = pandas_udf(multiply_func, returnType=LongType()) - -# The function for a pandas_udf should be able to execute with local Pandas data -x = pd.Series([1, 2, 3]) -print(multiply_func(x, x)) -# 0 1 -# 1 4 -# 2 9 -# dtype: int64 - -# Create a Spark DataFrame, 'spark' is an existing SparkSession -df = spark.createDataFrame(pd.DataFrame(x, columns=["x"])) - -# Execute function as a Spark vectorized UDF -df.select(multiply(col("x"), col("x"))).show() -# +-------------------+ -# |multiply_func(x, x)| -# +-------------------+ -# | 1| -# | 4| -# | 9| -# +-------------------+ --- End diff -- Oh, I wanted to keep the example simple. I can add them back if you think they are useful. Maybe as another section? It's cleaner to separate the code that is needed to use the API and code that explains what's going on.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org