Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/19575#discussion_r163960394
--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,250 @@ Configuration of Hive is done by placing your
`hive-site.xml`, `core-site.xml` a
You may run `./bin/spark-sql --help` for a complete list of all available
options.
+# PySpark Usage Guide for Pandas with Arrow
+
+## Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require
some minor
+changes to configuration or code to take full advantage and ensure
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, it first must be enabled by
setting the Spark configuration
+'spark.sql.execution.arrow.enabled' to 'true', this is disabled by default.
+
+<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% highlight python %}
+
+import numpy as np
+import pandas as pd
+
+# Enable Arrow, 'spark' is an existing SparkSession
+spark.conf.set("spark.sql.execution.arrow.enabled", "true")
+
+# Generate sample data
+pdf = pd.DataFrame(np.random.rand(100, 3))
+
+# Create a Spark DataFrame from Pandas data using Arrow
+df = spark.createDataFrame(pdf)
+
+# Convert the Spark DataFrame to a local Pandas DataFrame
+selpdf = df.select("*").toPandas()
+
+{% endhighlight %}
+</div>
+</div>
+
+Using the above optimizations with Arrow will produce the same results as
when Arrow is not
+enabled. Not all Spark data types are currently supported and an error
will be raised if a column
+has an unsupported type, see [Supported Types](#supported-types).
+
+## Pandas UDFs (a.k.a Vectorized UDFs)
+
+With Arrow, we introduce a new type of UDF - pandas UDF. Pandas UDF is
defined with a new function
+`pyspark.sql.functions.pandas_udf` and allows user to use functions that
operate on `pandas.Series`
+and `pandas.DataFrame` with Spark. Currently, there are two types of
pandas UDF: Scalar and Group Map.
+
+### Scalar
+
+Scalar pandas UDFs are used for vectorizing scalar operations. They can
used with functions such as `select`
+and `withColumn`. To define a scalar pandas UDF, use `pandas_udf` to
annotate a Python function. The Python
+should takes `pandas.Series` and returns a `pandas.Series` of the same
size. Internally, Spark will
+split a column into multiple `pandas.Series` and invoke the Python
function with each `pandas.Series`, and
+concat the results together to be a new column.
+
+The following example shows how to create a scalar pandas UDF that
computes the product of 2 columns.
+
+<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% highlight python %}
+
+import pandas as pd
+from pyspark.sql.functions import pandas_udf, PandasUDFTypr
+
+df = spark.createDataFrame(
+ [(1,), (2,), (3,)],
+ ['v'])
+
+# Declare the function and create the UDF
+@pandas_udf('long', PandasUDFType.SCALAR)
+def multiply_udf(a, b):
+ # a and b are both pandas.Series
+ return a * b
+
+df.select(multiply_udf(df.v, df.v)).show()
+# +------------------+
+# |multiply_udf(v, v)|
+# +------------------+
+# | 1|
+# | 4|
+# | 9|
+# +------------------+
+
+{% endhighlight %}
+</div>
+</div>
+
+Note that there are two important requirement when using scalar pandas
UDFs:
+* The input and output series must have the same size.
+* How a column is splitted into multiple `pandas.Series` is internal to
Spark, and therefore the result
+ of user-defined function must be independent of the splitting.
+
+### Group Map
+Group map pandas UDFs are used with `groupBy().apply()` which implements
the "split-apply-combine" pattern.
+Split-apply-combine consists of three steps:
+* Split the data into groups by using `DataFrame.groupBy`.
+* Apply a function on each group. The input and output of the function are
both `pandas.DataFrame`. The
+ input data contains all the rows and columns for each group.
+* Combine the results into a new `DataFrame`.
+
+To use groupby apply, user needs to define the following:
+* A Python function that defines the computation for each group.
+* A `StructType` object or a string that defines the schema of the output
`DataFrame`.
+
+Here we show two examples of using group map pandas UDFs.
+
+The first example shows a simple use case: subtracting the mean from each
value in the group.
+
+<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% highlight python %}
+
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+df = spark.createDataFrame(
+ [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
+ ("id", "v"))
+
+@pandas_udf("id long, v double", PandasUDFType.GROUP_MAP)
+def substract_mean(pdf):
+ # pdf is a pandas.DataFrame
+ v = pdf.v
+ return pdf.assign(v=v - v.mean())
+
+df.groupby("id").apply(substract_mean).show()
+# +---+----+
+# | id| v|
+# +---+----+
+# | 1|-0.5|
+# | 1| 0.5|
+# | 2|-3.0|
+# | 2|-1.0|
+# | 2| 4.0|
+# +---+----+
+
+{% endhighlight %}
+</div>
+</div>
+
+The second example is a more complicated example. It shows how to run a
OLS linear regression
--- End diff --
would this second example be better as a separate file in "examples"
instead of including in the user guide?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]