[GitHub] [spark] viirya commented on a change in pull request #29548: [SPARK-32183][DOCS][PYTHON] User Guide - PySpark Usage Guide for Pandas with Apache Arrow

GitBox Thu, 27 Aug 2020 16:40:30 -0700


viirya commented on a change in pull request #29548:
URL: https://github.com/apache/spark/pull/29548#discussion_r478752999




##########
File path: python/docs/source/user_guide/arrow_pandas.rst
##########
@@ -0,0 +1,411 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+=======================
+Apache Arrow in PySpark
+=======================
+
+.. currentmodule:: pyspark.sql
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial to 
Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require some 
minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight any 
differences when
+working with Arrow-enabled data.
+
+Ensure PyArrow Installed
+------------------------
+
+To use Apache Arrow in PySpark, `the recommended version of PyArrow 
<arrow_pandas.rst#recommended-pandas-and-pyarrow-versions>`_
+should be installed.
+If you install PySpark using pip, then PyArrow can be brought in as an extra 
dependency of the
+SQL module with the command ``pip install pyspark[sql]``. Otherwise, you must 
ensure that PyArrow
+is installed and available on all cluster nodes.
+You can install using pip or conda from the conda-forge channel. See PyArrow
+`installation <https://arrow.apache.org/docs/python/install.html>`_ for 
details.
+
+Enabling for Conversion to/from Pandas
+--------------------------------------
+
+Arrow is available as an optimization when converting a Spark DataFrame to a 
Pandas DataFrame
+using the call :meth:`DataFrame.toPandas` and when creating a Spark DataFrame 
from a Pandas DataFrame with
+:meth:`SparkSession.createDataFrame`. To use Arrow when executing these calls, 
users need to first set
+the Spark configuration ``spark.sql.execution.arrow.pyspark.enabled`` to 
``true``. This is disabled by default.
+
+In addition, optimizations enabled by 
``spark.sql.execution.arrow.pyspark.enabled`` could fallback automatically
+to non-Arrow optimization implementation if an error occurs before the actual 
computation within Spark.
+This can be controlled by 
``spark.sql.execution.arrow.pyspark.fallback.enabled``.
+
+.. literalinclude:: ../../../../examples/src/main/python/sql/arrow.py
+    :language: python
+    :lines: 35-48
+    :dedent: 4
+
+Using the above optimizations with Arrow will produce the same results as when 
Arrow is not
+enabled.
+
+Note that even with Arrow, :meth:`DataFrame.toPandas` results in the 
collection of all records in the
+DataFrame to the driver program and should be done on a small subset of the 
data. Not all Spark
+data types are currently supported and an error can be raised if a column has 
an unsupported type.
+If an error occurs during :meth:`SparkSession.createDataFrame`, Spark will 
fall back to create the
+DataFrame without Arrow.
+
+Pandas UDFs (a.k.a. Vectorized UDFs)
+------------------------------------
+
+.. currentmodule:: pyspark.sql.functions
+
+Pandas UDFs are user defined functions that are executed by Spark using
+Arrow to transfer data and Pandas to work with the data, which allows 
vectorized operations. A Pandas
+UDF is defined using the :meth:`pandas_udf` as a decorator or to wrap the 
function, and no additional
+configuration is required. A Pandas UDF behaves as a regular PySpark function 
API in general.
+
+Before Spark 3.0, Pandas UDFs used to be defined with 
``pyspark.sql.functions.PandasUDFType``. From Spark 3.0
+with Python 3.6+, you can also use `Python type hints 
<https://www.python.org/dev/peps/pep-0484>`_.
+Using Python type hints are preferred and using 
``pyspark.sql.functions.PandasUDFType`` will be deprecated in

Review comment:
       nit: Using Python type hints is preferred..




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on a change in pull request #29548: [SPARK-32183][DOCS][PYTHON] User Guide - PySpark Usage Guide for Pandas with Apache Arrow

Reply via email to