This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new c1708c94fb1 [SPARK-41583][CONNECT][PROTOBUF] Add Spark Connect and
protobuf into setup.py with specifying dependencies
c1708c94fb1 is described below
commit c1708c94fb136cc9c01c6f8461fdc8ade7175894
Author: Hyukjin Kwon <[email protected]>
AuthorDate: Tue Dec 20 09:04:13 2022 +0900
[SPARK-41583][CONNECT][PROTOBUF] Add Spark Connect and protobuf into
setup.py with specifying dependencies
### What changes were proposed in this pull request?
This PR proposes to:
- Add `pyspark.sql.connect` and `pyspark.sql.protobuf` to the PySpark
package in PyPI.
- Fix the documentation to specify the dependencies for Python Spark
Connect client.
### Why are the changes needed?
To guide users to use Spark Connect and Protobuf, and package these feature
to be released properly.
### Does this PR introduce _any_ user-facing change?
Yes, this exposes both `pyspark.sql.connect` and `pyspark.sql.protobuf` to
the end users in PyPI package. In addition, this fixes the user-facing
documentation about dependencies from Spark Connect.
### How was this patch tested?
CI in this PR should test it out.
Closes #39123 from HyukjinKwon/SPARK-41583.
Lead-authored-by: Hyukjin Kwon <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/docs/source/getting_started/install.rst | 15 ++++++++-------
python/setup.py | 10 ++++++++++
2 files changed, 18 insertions(+), 7 deletions(-)
diff --git a/python/docs/source/getting_started/install.rst
b/python/docs/source/getting_started/install.rst
index dd48741099c..d3b24be3d49 100644
--- a/python/docs/source/getting_started/install.rst
+++ b/python/docs/source/getting_started/install.rst
@@ -50,6 +50,8 @@ If you want to install extra dependencies for a specific
component, you can inst
pip install pyspark[sql]
# pandas API on Spark
pip install pyspark[pandas_on_spark] plotly # to plot your data, you can
install plotly together.
+ # Spark Connect
+ pip install pyspark[connect]
For PySpark with/without a specific Hadoop version, you can install it by
using ``PYSPARK_HADOOP_VERSION`` environment variables as below:
@@ -151,16 +153,15 @@ To install PySpark from source, refer to
|building_spark|_.
Dependencies
------------
-============= ========================= ======================================
+============= =========================
======================================================================================
Package Minimum supported version Note
-============= ========================= ======================================
-`pandas` 1.0.5 Optional for Spark SQL
-`pyarrow` 1.0.0 Optional for Spark SQL
+============= =========================
======================================================================================
`py4j` 0.10.9.7 Required
-`pandas` 1.0.5 Required for pandas API on Spark
-`pyarrow` 1.0.0 Required for pandas API on Spark
+`pandas` 1.0.5 Required for pandas API on Spark and
Spark Connect; Optional for Spark SQL
+`pyarrow` 1.0.0 Required for pandas API on Spark and
Spark Connect; Optional for Spark SQL
`numpy` 1.15 Required for pandas API on Spark and
MLLib DataFrame-based API; Optional for Spark SQL
-============= ========================= ======================================
+`grpc` 1.48.1 Required for Spark Connect
+============= =========================
======================================================================================
Note that PySpark requires Java 8 or later with ``JAVA_HOME`` properly set.
If using JDK 11, set ``-Dio.netty.tryReflectionSetAccessible=true`` for Arrow
related features and refer
diff --git a/python/setup.py b/python/setup.py
index 65db3912efe..4ba2740246a 100755
--- a/python/setup.py
+++ b/python/setup.py
@@ -113,6 +113,7 @@ if (in_spark):
# Also don't forget to update python/docs/source/getting_started/install.rst.
_minimum_pandas_version = "1.0.5"
_minimum_pyarrow_version = "1.0.0"
+_minimum_grpc_version = "1.48.1"
class InstallCommand(install):
@@ -215,7 +216,10 @@ try:
'pyspark.ml.param',
'pyspark.sql',
'pyspark.sql.avro',
+ 'pyspark.sql.connect',
+ 'pyspark.sql.connect.proto',
'pyspark.sql.pandas',
+ 'pyspark.sql.protobuf',
'pyspark.sql.streaming',
'pyspark.streaming',
'pyspark.bin',
@@ -273,6 +277,12 @@ try:
'pyarrow>=%s' % _minimum_pyarrow_version,
'numpy>=1.15',
],
+ 'connect': [
+ 'pandas>=%s' % _minimum_pandas_version,
+ 'pyarrow>=%s' % _minimum_pyarrow_version,
+ 'grpc>=%s' % _minimum_grpc_version,
+ 'numpy>=1.15',
+ ],
},
python_requires='>=3.7',
classifiers=[
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]