Re: [PR] [SPARK-47864][PYTHON][DOCS] Enhance "Installation" page to cover all installable options [spark]

via GitHub Wed, 17 Apr 2024 17:18:27 -0700


itholic commented on code in PR #46096:
URL: https://github.com/apache/spark/pull/46096#discussion_r1569710412



##########
python/docs/source/getting_started/install.rst:
##########
@@ -165,16 +168,117 @@ To install PySpark from source, refer to 
|building_spark|_.
 
 Dependencies
 ------------
-========================== ========================= 
======================================================================================
-Package                    Supported version Note
-========================== ========================= 
======================================================================================
-`py4j`                     >=0.10.9.7                Required
-`pandas`                   >=1.4.4                   Required for pandas API 
on Spark and Spark Connect; Optional for Spark SQL
-`pyarrow`                  >=10.0.0                  Required for pandas API 
on Spark and Spark Connect; Optional for Spark SQL
-`numpy`                    >=1.21                    Required for pandas API 
on Spark and MLLib DataFrame-based API; Optional for Spark SQL
-`grpcio`                   >=1.62.0                  Required for Spark Connect
-`grpcio-status`            >=1.62.0                  Required for Spark Connect
-`googleapis-common-protos` >=1.56.4                  Required for Spark Connect
-========================== ========================= 
======================================================================================
+
+Required dependencies
+~~~~~~~~~~~~~~~~~~~~~
+
+PySpark requires the following dependencies.
+
+========================== ========================= 
============================================
+Package                    Supported version         Note
+========================== ========================= 
============================================
+`py4j`                     >=0.10.9.7                Essential for Python to 
interface with the
+                                                     Java objects in Spark; 
ensures seamless
+                                                     interaction between 
Python and JVM.
+========================== ========================= 
============================================
+
+Additional libraries that enhance functionality but are not included in the 
installation packages:
+
+- **memory-profiler**: Useful for diagnosing and analyzing memory usage in 
PySpark applications.
 
 Note that PySpark requires Java 17 or later with ``JAVA_HOME`` properly set 
and refer to |downloading|_.
+
+
+.. _optional-dependencies:
+
+Optional dependencies
+~~~~~~~~~~~~~~~~~~~~~
+
+PySpark has several optional dependencies that enhance its functionality for 
specific modules.
+These dependencies are only required for certain features and are not 
necessary for the basic functionality of PySpark.
+If these optional dependencies are not installed, PySpark will function 
correctly for basic operations but will raise an ``ImportError``
+when you try to use features that require these dependencies.
+
+Spark SQL
+^^^^^^^^^
+
+Installable with ``pip install "pyspark[sql]"``.
+
+========= ================= 
==================================================================
+Package   Supported version Note
+========= ================= 
==================================================================
+`pandas`  >=1.4.4           Enables seamless DataFrame operations between 
Spark and Pandas.
+`pyarrow` >=10.0.0          Optimizes data conversion and transfer between 
PySpark and Pandas.
+`numpy`   >=1.21            Essential for numerical data manipulation within 
PySpark.
+========= ================= 
==================================================================
+
+
+Pandas API on Spark
+^^^^^^^^^^^^^^^^^^^
+
+Installable with ``pip install "pyspark[pandas_on_spark]"``.
+
+========= ================= 
=====================================================================
+Package   Supported version Note
+========= ================= 
=====================================================================
+`pandas`  >=1.4.4           Required for utilizing the Pandas API features in 
Spark.
+`pyarrow` >=10.0.0          Ensures efficient data handling and performance in 
Pandas operations.
+`numpy`   >=1.21            Facilitates complex numerical operations within 
Spark.
+========= ================= 
=====================================================================
+
+Additional libraries that enhance functionality but are not included in the 
installation packages:
+
+- **mlflow**: Enhances machine learning lifecycle management, including 
experiment tracking and model deployment.
+- **plotly, matplotlib**: Provide advanced plotting capabilities for 
visualization.
+
+
+ML
+^^
+
+Installable with ``pip install "pyspark[ml]"``.
+
+======= ================= 
=======================================================================
+Package Supported version Note
+======= ================= 
=======================================================================
+`numpy` >=1.21            Supports advanced data manipulation and algorithm 
implementation in ML.
+======= ================= 
=======================================================================
+
+Additional libraries that enhance functionality but are not included in the 
installation packages:
+
+- **scipy**: Essential for scientific computing and statistical functions in 
ML.
+- **scikit-learn**: Required for implementing machine learning algorithms.
+
+MLlib
+^^^^^
+
+Installable with ``pip install "pyspark[mllib]"``.
+
+======= ================= 
====================================================================================================
+Package Supported version Note
+======= ================= 
====================================================================================================
+`numpy` >=1.21            Essential for mathematical operations within MLlib, 
improves performance and accuracy of algorithms.
+======= ================= 
====================================================================================================
+
+Additional libraries that enhance functionality but are not included in the 
installation packages:
+
+- **torch**: Utilized for machine learning model training on PySpark.
+- **torchvision**: Supports image and video processing within PySpark models.
+- **torcheval**: Facilitates model evaluation metrics in PySpark.
+- **deepspeed; sys_platform != 'darwin'**: Provides high-performance model 
training optimizations. Installable on non-Darwin systems.
+
+
+Spark Connect
+^^^^^^^^^^^^^
+
+Installable with ``pip install "pyspark[connect]"``.

Review Comment:
   Moved the section to the top.
   
   IIRC `pyspark-connect` is not yet officially supported by pip install so I 
don't mention for now. May can update the documentation when it's officially 
released. @HyukjinKwon could you double check if I understand correctly? 



##########
python/docs/source/getting_started/install.rst:
##########
@@ -165,16 +168,117 @@ To install PySpark from source, refer to 
|building_spark|_.
 
 Dependencies
 ------------
-========================== ========================= 
======================================================================================
-Package                    Supported version Note
-========================== ========================= 
======================================================================================
-`py4j`                     >=0.10.9.7                Required
-`pandas`                   >=1.4.4                   Required for pandas API 
on Spark and Spark Connect; Optional for Spark SQL
-`pyarrow`                  >=10.0.0                  Required for pandas API 
on Spark and Spark Connect; Optional for Spark SQL
-`numpy`                    >=1.21                    Required for pandas API 
on Spark and MLLib DataFrame-based API; Optional for Spark SQL
-`grpcio`                   >=1.62.0                  Required for Spark Connect
-`grpcio-status`            >=1.62.0                  Required for Spark Connect
-`googleapis-common-protos` >=1.56.4                  Required for Spark Connect
-========================== ========================= 
======================================================================================
+
+Required dependencies
+~~~~~~~~~~~~~~~~~~~~~
+
+PySpark requires the following dependencies.
+
+========================== ========================= 
============================================
+Package                    Supported version         Note
+========================== ========================= 
============================================
+`py4j`                     >=0.10.9.7                Essential for Python to 
interface with the
+                                                     Java objects in Spark; 
ensures seamless
+                                                     interaction between 
Python and JVM.
+========================== ========================= 
============================================
+
+Additional libraries that enhance functionality but are not included in the 
installation packages:
+
+- **memory-profiler**: Useful for diagnosing and analyzing memory usage in 
PySpark applications.
 
 Note that PySpark requires Java 17 or later with ``JAVA_HOME`` properly set 
and refer to |downloading|_.
+
+
+.. _optional-dependencies:
+
+Optional dependencies
+~~~~~~~~~~~~~~~~~~~~~
+
+PySpark has several optional dependencies that enhance its functionality for 
specific modules.
+These dependencies are only required for certain features and are not 
necessary for the basic functionality of PySpark.
+If these optional dependencies are not installed, PySpark will function 
correctly for basic operations but will raise an ``ImportError``
+when you try to use features that require these dependencies.
+
+Spark SQL
+^^^^^^^^^
+
+Installable with ``pip install "pyspark[sql]"``.
+
+========= ================= 
==================================================================
+Package   Supported version Note
+========= ================= 
==================================================================
+`pandas`  >=1.4.4           Enables seamless DataFrame operations between 
Spark and Pandas.
+`pyarrow` >=10.0.0          Optimizes data conversion and transfer between 
PySpark and Pandas.
+`numpy`   >=1.21            Essential for numerical data manipulation within 
PySpark.
+========= ================= 
==================================================================
+
+
+Pandas API on Spark
+^^^^^^^^^^^^^^^^^^^
+
+Installable with ``pip install "pyspark[pandas_on_spark]"``.
+
+========= ================= 
=====================================================================
+Package   Supported version Note
+========= ================= 
=====================================================================
+`pandas`  >=1.4.4           Required for utilizing the Pandas API features in 
Spark.
+`pyarrow` >=10.0.0          Ensures efficient data handling and performance in 
Pandas operations.
+`numpy`   >=1.21            Facilitates complex numerical operations within 
Spark.
+========= ================= 
=====================================================================
+
+Additional libraries that enhance functionality but are not included in the 
installation packages:
+
+- **mlflow**: Enhances machine learning lifecycle management, including 
experiment tracking and model deployment.
+- **plotly, matplotlib**: Provide advanced plotting capabilities for 
visualization.
+
+
+ML
+^^
+
+Installable with ``pip install "pyspark[ml]"``.
+
+======= ================= 
=======================================================================
+Package Supported version Note
+======= ================= 
=======================================================================
+`numpy` >=1.21            Supports advanced data manipulation and algorithm 
implementation in ML.
+======= ================= 
=======================================================================
+
+Additional libraries that enhance functionality but are not included in the 
installation packages:
+
+- **scipy**: Essential for scientific computing and statistical functions in 
ML.
+- **scikit-learn**: Required for implementing machine learning algorithms.
+
+MLlib
+^^^^^
+
+Installable with ``pip install "pyspark[mllib]"``.
+
+======= ================= 
====================================================================================================
+Package Supported version Note
+======= ================= 
====================================================================================================
+`numpy` >=1.21            Essential for mathematical operations within MLlib, 
improves performance and accuracy of algorithms.
+======= ================= 
====================================================================================================
+
+Additional libraries that enhance functionality but are not included in the 
installation packages:
+
+- **torch**: Utilized for machine learning model training on PySpark.
+- **torchvision**: Supports image and video processing within PySpark models.
+- **torcheval**: Facilitates model evaluation metrics in PySpark.
+- **deepspeed; sys_platform != 'darwin'**: Provides high-performance model 
training optimizations. Installable on non-Darwin systems.
+
+
+Spark Connect
+^^^^^^^^^^^^^
+
+Installable with ``pip install "pyspark[connect]"``.

Review Comment:
   Moved the section to the top.
   
   IIRC `pyspark-connect` is not yet officially supported by pip install so I 
don't mention for now. Maybe we can update the documentation when it's 
officially released. @HyukjinKwon could you double check if I understand 
correctly? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47864][PYTHON][DOCS] Enhance "Installation" page to cover all installable options [spark]

Reply via email to