Re: [PR] [SPARK-47864][PYTHON][DOCS] Enhance "Installation" page to cover all installable options [spark]

via GitHub Wed, 17 Apr 2024 00:18:03 -0700


HyukjinKwon commented on code in PR #46096:
URL: https://github.com/apache/spark/pull/46096#discussion_r1568344903



##########
python/docs/source/getting_started/install.rst:
##########
@@ -165,16 +168,92 @@ To install PySpark from source, refer to 
|building_spark|_.
 
 Dependencies
 ------------
+
+Required dependencies
+~~~~~~~~~~~~~~~~~~~~~
+
+PySpark requires the following dependencies.
+
+========================== ========================= 
============================================
+Package                    Supported version         Note
+========================== ========================= 
============================================
+`py4j`                     >=0.10.9.7                Essential for Python to 
interface with the
+                                                     Java objects in Spark; 
ensures seamless
+                                                     interaction between 
Python and JVM.
+========================== ========================= 
============================================
+
+Note that PySpark requires Java 17 or later with ``JAVA_HOME`` properly set 
and refer to |downloading|_.
+
+
+.. _optional-dependencies:
+
+Optional dependencies
+~~~~~~~~~~~~~~~~~~~~~
+
+PySpark has several optional dependencies that enhance its functionality for 
specific modules. These dependencies are only required for certain features and 
are not necessary for the basic functionality of PySpark. If these optional 
dependencies are not installed, PySpark will function correctly for basic 
operations but will raise an ``ImportError`` when you try to use features that 
require these dependencies.
+
+Spark SQL
+^^^^^^^^^
+
+Installable with ``pip install "pyspark[sql]"``.
+
+========================== ========================= 
======================================================
+Package                    Supported version         Note
+========================== ========================= 
======================================================
+`pandas`                   >=1.4.4                   Enables seamless 
DataFrame operations between Spark and Pandas.
+`pyarrow`                  >=10.0.0                  Optimizes data conversion 
and transfer between PySpark and Pandas.
+`numpy`                    >=1.21                    Essential for numerical 
data manipulation within PySpark.
+========================== ========================= 
======================================================
+
+Pandas API on Spark
+^^^^^^^^^^^^^^^^^^^
+
+Installable with ``pip install "pyspark[pandas_on_spark]"``.
+
+========================== ========================= 
======================================================
+Package                    Supported version         Note
+========================== ========================= 
======================================================
+`pandas`                   >=1.4.4                   Required for utilizing 
the Pandas API features in Spark.
+`pyarrow`                  >=10.0.0                  Ensures efficient data 
handling and performance in Pandas operations.
+`numpy`                    >=1.21                    Facilitates complex 
numerical operations within Spark.
+========================== ========================= 
======================================================
+
+Note: Run ``pip install "pyspark[pandas_on_spark] plotly"`` if you want to use 
visualization features.
+
+ML
+^^
+
+Installable with ``pip install "pyspark[ml]"``.
+
 ========================== ========================= 
======================================================================================
-Package                    Supported version Note
+Package                    Supported version         Note
 ========================== ========================= 
======================================================================================
-`py4j`                     >=0.10.9.7                Required
-`pandas`                   >=1.4.4                   Required for pandas API 
on Spark and Spark Connect; Optional for Spark SQL
-`pyarrow`                  >=10.0.0                  Required for pandas API 
on Spark and Spark Connect; Optional for Spark SQL
-`numpy`                    >=1.21                    Required for pandas API 
on Spark and MLLib DataFrame-based API; Optional for Spark SQL
-`grpcio`                   >=1.62.0                  Required for Spark Connect
-`grpcio-status`            >=1.62.0                  Required for Spark Connect
-`googleapis-common-protos` >=1.56.4                  Required for Spark Connect
+`numpy`                    >=1.21                    Supports advanced data 
manipulation and algorithm implementation in ML.
 ========================== ========================= 
======================================================================================
 
-Note that PySpark requires Java 17 or later with ``JAVA_HOME`` properly set 
and refer to |downloading|_.
+MLlib
+^^^^^
+
+Installable with ``pip install "pyspark[mllib]"``.
+
+========================== ========================= 
======================================================================================
+Package                    Supported version         Note
+========================== ========================= 
======================================================================================

Review Comment:
   Let's match the number of `=` to the text



##########
python/docs/source/getting_started/install.rst:
##########
@@ -165,16 +168,92 @@ To install PySpark from source, refer to 
|building_spark|_.
 
 Dependencies
 ------------
+
+Required dependencies
+~~~~~~~~~~~~~~~~~~~~~
+
+PySpark requires the following dependencies.
+
+========================== ========================= 
============================================
+Package                    Supported version         Note
+========================== ========================= 
============================================
+`py4j`                     >=0.10.9.7                Essential for Python to 
interface with the
+                                                     Java objects in Spark; 
ensures seamless
+                                                     interaction between 
Python and JVM.
+========================== ========================= 
============================================
+
+Note that PySpark requires Java 17 or later with ``JAVA_HOME`` properly set 
and refer to |downloading|_.
+
+
+.. _optional-dependencies:
+
+Optional dependencies
+~~~~~~~~~~~~~~~~~~~~~
+
+PySpark has several optional dependencies that enhance its functionality for 
specific modules. These dependencies are only required for certain features and 
are not necessary for the basic functionality of PySpark. If these optional 
dependencies are not installed, PySpark will function correctly for basic 
operations but will raise an ``ImportError`` when you try to use features that 
require these dependencies.

Review Comment:
   let's add some newlines



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47864][PYTHON][DOCS] Enhance "Installation" page to cover all installable options [spark]

Reply via email to