[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

icexelloss Thu, 25 Jan 2018 12:53:12 -0800

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19575#discussion_r163964479
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1693,70 +1693,70 @@ Using the above optimizations with Arrow will 
produce the same results as when A
     enabled. Not all Spark data types are currently supported and an error 
will be raised if a column
     has an unsupported type, see [Supported Types](#supported-types).
     
    -## How to Write Vectorized UDFs
    +## Pandas UDFs (a.k.a Vectorized UDFs)
     
    -A vectorized UDF is similar to a standard UDF in Spark except the inputs 
and output will be
    -Pandas Series, which allow the function to be composed with vectorized 
operations. This function
    -can then be run very efficiently in Spark where data is sent in batches to 
Python and then
    -is executed using Pandas Series as the inputs. The exected output of the 
function is also a Pandas
    -Series of the same length as the inputs. A vectorized UDF is declared 
using the `pandas_udf`
    -keyword, no additional configuration is required.
    +With Arrow, we introduce a new type of UDF - pandas UDF. Pandas UDF is 
defined with a new function
    +`pyspark.sql.functions.pandas_udf` and allows user to use functions that 
operate on `pandas.Series`
    +and `pandas.DataFrame` with Spark. Currently, there are two types of 
pandas UDF: Scalar and Group Map.
     
    -The following example shows how to create a vectorized UDF that computes 
the product of 2 columns.
    +### Scalar
    +
    +Scalar pandas UDFs are used for vectorizing scalar operations. They can 
used with functions such as `select`
    +and `withColumn`. To define a scalar pandas UDF, use `pandas_udf` to 
annotate a Python function. The Python
    +should takes `pandas.Series` and returns a `pandas.Series` of the same 
size. Internally, Spark will
    +split a column into multiple `pandas.Series` and invoke the Python 
function with each `pandas.Series`, and
    +concat the results together to be a new column.
    +
    +The following example shows how to create a scalar pandas UDF that 
computes the product of 2 columns.
     
     <div class="codetabs">
     <div data-lang="python"  markdown="1">
     {% highlight python %}
     
     import pandas as pd
    -from pyspark.sql.functions import col, pandas_udf
    -from pyspark.sql.types import LongType
    +from pyspark.sql.functions import pandas_udf, PandasUDFTypr
    +
    +df = spark.createDataFrame(
    +    [(1,), (2,), (3,)],
    +    ['v'])
     
     # Declare the function and create the UDF
    -def multiply_func(a, b):
    +@pandas_udf('long', PandasUDFType.SCALAR)
    +def multiply_udf(a, b):
    +    # a and b are both pandas.Series
         return a * b
     
    -multiply = pandas_udf(multiply_func, returnType=LongType())
    -
    -# The function for a pandas_udf should be able to execute with local 
Pandas data
    -x = pd.Series([1, 2, 3])
    -print(multiply_func(x, x))
    -# 0    1
    -# 1    4
    -# 2    9
    -# dtype: int64
    -
    -# Create a Spark DataFrame, 'spark' is an existing SparkSession
    -df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
    -
    -# Execute function as a Spark vectorized UDF
    -df.select(multiply(col("x"), col("x"))).show()
    -# +-------------------+
    -# |multiply_func(x, x)|
    -# +-------------------+
    -# |                  1|
    -# |                  4|
    -# |                  9|
    -# +-------------------+
    --- End diff --
    
    Oh, I wanted to keep the example simple. I can add them back if you think 
they are useful. Maybe as another section? It's cleaner to separate the code 
that is needed to use the API and code that explains what's going on.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

Reply via email to