[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

BryanCutler Thu, 25 Jan 2018 12:35:22 -0800

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19575#discussion_r163960394
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1640,6 +1640,250 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
     You may run `./bin/spark-sql --help` for a complete list of all available
     options.
     
    +# PySpark Usage Guide for Pandas with Arrow
    +
    +## Arrow in Spark
    +
    +Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
    +data between JVM and Python processes. This currently is most beneficial 
to Python users that
    +work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
    +changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
    +give a high-level description of how to use Arrow in Spark and highlight 
any differences when
    +working with Arrow-enabled data.
    +
    +### Ensure PyArrow Installed
    +
    +If you install PySpark using pip, then PyArrow can be brought in as an 
extra dependency of the
    +SQL module with the command `pip install pyspark[sql]`. Otherwise, you 
must ensure that PyArrow
    +is installed and available on all cluster nodes. The current supported 
version is 0.8.0.
    +You can install using pip or conda from the conda-forge channel. See 
PyArrow
    +[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
    +
    +## Enabling for Conversion to/from Pandas
    +
    +Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call
    +`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
    +To use Arrow when executing these calls, it first must be enabled by 
setting the Spark configuration
    +'spark.sql.execution.arrow.enabled' to 'true', this is disabled by default.
    +
    +<div class="codetabs">
    +<div data-lang="python"  markdown="1">
    +{% highlight python %}
    +
    +import numpy as np
    +import pandas as pd
    +
    +# Enable Arrow, 'spark' is an existing SparkSession
    +spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    +
    +# Generate sample data
    +pdf = pd.DataFrame(np.random.rand(100, 3))
    +
    +# Create a Spark DataFrame from Pandas data using Arrow
    +df = spark.createDataFrame(pdf)
    +
    +# Convert the Spark DataFrame to a local Pandas DataFrame
    +selpdf = df.select("*").toPandas()
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +Using the above optimizations with Arrow will produce the same results as 
when Arrow is not
    +enabled. Not all Spark data types are currently supported and an error 
will be raised if a column
    +has an unsupported type, see [Supported Types](#supported-types).
    +
    +## Pandas UDFs (a.k.a Vectorized UDFs)
    +
    +With Arrow, we introduce a new type of UDF - pandas UDF. Pandas UDF is 
defined with a new function
    +`pyspark.sql.functions.pandas_udf` and allows user to use functions that 
operate on `pandas.Series`
    +and `pandas.DataFrame` with Spark. Currently, there are two types of 
pandas UDF: Scalar and Group Map.
    +
    +### Scalar
    +
    +Scalar pandas UDFs are used for vectorizing scalar operations. They can 
used with functions such as `select`
    +and `withColumn`. To define a scalar pandas UDF, use `pandas_udf` to 
annotate a Python function. The Python
    +should takes `pandas.Series` and returns a `pandas.Series` of the same 
size. Internally, Spark will
    +split a column into multiple `pandas.Series` and invoke the Python 
function with each `pandas.Series`, and
    +concat the results together to be a new column.
    +
    +The following example shows how to create a scalar pandas UDF that 
computes the product of 2 columns.
    +
    +<div class="codetabs">
    +<div data-lang="python"  markdown="1">
    +{% highlight python %}
    +
    +import pandas as pd
    +from pyspark.sql.functions import pandas_udf, PandasUDFTypr
    +
    +df = spark.createDataFrame(
    +    [(1,), (2,), (3,)],
    +    ['v'])
    +
    +# Declare the function and create the UDF
    +@pandas_udf('long', PandasUDFType.SCALAR)
    +def multiply_udf(a, b):
    +    # a and b are both pandas.Series
    +    return a * b
    +
    +df.select(multiply_udf(df.v, df.v)).show()
    +# +------------------+
    +# |multiply_udf(v, v)|
    +# +------------------+
    +# |                 1|
    +# |                 4|
    +# |                 9|
    +# +------------------+
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +Note that there are two important requirement when using scalar pandas 
UDFs:
    +* The input and output series must have the same size.
    +* How a column is splitted into multiple `pandas.Series` is internal to 
Spark, and therefore the result
    +  of user-defined function must be independent of the splitting.
    +
    +### Group Map
    +Group map pandas UDFs are used with `groupBy().apply()` which implements 
the "split-apply-combine" pattern.
    +Split-apply-combine consists of three steps:
    +* Split the data into groups by using `DataFrame.groupBy`.
    +* Apply a function on each group. The input and output of the function are 
both `pandas.DataFrame`. The
    +  input data contains all the rows and columns for each group.
    +* Combine the results into a new `DataFrame`.
    +
    +To use groupby apply, user needs to define the following:
    +* A Python function that defines the computation for each group.
    +* A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
    +
    +Here we show two examples of using group map pandas UDFs.
    +
    +The first example shows a simple use case: subtracting the mean from each 
value in the group.
    +
    +<div class="codetabs">
    +<div data-lang="python"  markdown="1">
    +{% highlight python %}
    +
    +from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +df = spark.createDataFrame(
    +    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    +    ("id", "v"))
    +
    +@pandas_udf("id long, v double", PandasUDFType.GROUP_MAP)
    +def substract_mean(pdf):
    +    # pdf is a pandas.DataFrame
    +    v = pdf.v
    +    return pdf.assign(v=v - v.mean())
    +
    +df.groupby("id").apply(substract_mean).show()
    +# +---+----+
    +# | id|   v|
    +# +---+----+
    +# |  1|-0.5|
    +# |  1| 0.5|
    +# |  2|-3.0|
    +# |  2|-1.0|
    +# |  2| 4.0|
    +# +---+----+
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +The second example is a more complicated example. It shows how to run a 
OLS linear regression
    --- End diff --
    
    would this second example be better as a separate file in "examples" 
instead of including in the user guide?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

Reply via email to