Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/19575#discussion_r164176830
--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,129 @@ Configuration of Hive is done by placing your
`hive-site.xml`, `core-site.xml` a
You may run `./bin/spark-sql --help` for a complete list of all available
options.
+# PySpark Usage Guide for Pandas with Arrow
+
+## Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require
some minor
+changes to configuration or code to take full advantage and ensure
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, users need to first set the Spark
configuration
+'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
+
+<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+</div>
+</div>
+
+Using the above optimizations with Arrow will produce the same results as
when Arrow is not
+enabled. Not all Spark data types are currently supported and an error
will be raised if a column
+has an unsupported type, see [Supported Types](#supported-types).
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+With Arrow, we introduce a new type of UDF - pandas UDF. Pandas UDF is
defined with a new function
+`pyspark.sql.functions.pandas_udf` and allows users to use functions that
operate on `pandas.Series`
+and `pandas.DataFrame` with Spark. Currently, there are two types of
pandas UDF: Scalar and Group Map.
--- End diff --
I think this should say that it doesn't need the configuration from the
prev section to be enabled
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]