HyukjinKwon commented on code in PR #46096: URL: https://github.com/apache/spark/pull/46096#discussion_r1568487322
########## python/docs/source/getting_started/install.rst: ########## @@ -165,16 +168,117 @@ To install PySpark from source, refer to |building_spark|_. Dependencies ------------ -========================== ========================= ====================================================================================== -Package Supported version Note -========================== ========================= ====================================================================================== -`py4j` >=0.10.9.7 Required -`pandas` >=1.4.4 Required for pandas API on Spark and Spark Connect; Optional for Spark SQL -`pyarrow` >=10.0.0 Required for pandas API on Spark and Spark Connect; Optional for Spark SQL -`numpy` >=1.21 Required for pandas API on Spark and MLLib DataFrame-based API; Optional for Spark SQL -`grpcio` >=1.62.0 Required for Spark Connect -`grpcio-status` >=1.62.0 Required for Spark Connect -`googleapis-common-protos` >=1.56.4 Required for Spark Connect -========================== ========================= ====================================================================================== + +Required dependencies +~~~~~~~~~~~~~~~~~~~~~ + +PySpark requires the following dependencies. + +========================== ========================= ============================================ +Package Supported version Note +========================== ========================= ============================================ +`py4j` >=0.10.9.7 Essential for Python to interface with the + Java objects in Spark; ensures seamless + interaction between Python and JVM. +========================== ========================= ============================================ + +Additional libraries that enhance functionality but are not included in the installation packages: + +- **memory-profiler**: Useful for diagnosing and analyzing memory usage in PySpark applications. Note that PySpark requires Java 17 or later with ``JAVA_HOME`` properly set and refer to |downloading|_. + + +.. _optional-dependencies: + +Optional dependencies +~~~~~~~~~~~~~~~~~~~~~ + +PySpark has several optional dependencies that enhance its functionality for specific modules. +These dependencies are only required for certain features and are not necessary for the basic functionality of PySpark. +If these optional dependencies are not installed, PySpark will function correctly for basic operations but will raise an ``ImportError`` +when you try to use features that require these dependencies. + +Spark SQL +^^^^^^^^^ + +Installable with ``pip install "pyspark[sql]"``. + +========= ================= ================================================================== +Package Supported version Note +========= ================= ================================================================== +`pandas` >=1.4.4 Enables seamless DataFrame operations between Spark and Pandas. +`pyarrow` >=10.0.0 Optimizes data conversion and transfer between PySpark and Pandas. +`numpy` >=1.21 Essential for numerical data manipulation within PySpark. +========= ================= ================================================================== + + +Pandas API on Spark +^^^^^^^^^^^^^^^^^^^ + +Installable with ``pip install "pyspark[pandas_on_spark]"``. + +========= ================= ===================================================================== +Package Supported version Note +========= ================= ===================================================================== +`pandas` >=1.4.4 Required for utilizing the Pandas API features in Spark. +`pyarrow` >=10.0.0 Ensures efficient data handling and performance in Pandas operations. +`numpy` >=1.21 Facilitates complex numerical operations within Spark. +========= ================= ===================================================================== + +Additional libraries that enhance functionality but are not included in the installation packages: + +- **mlflow**: Enhances machine learning lifecycle management, including experiment tracking and model deployment. +- **plotly, matplotlib**: Provide advanced plotting capabilities for visualization. Review Comment: Let's list `plotly` and `matplotlib` separately, and mention which API we use, and say `plotly` is preferred. ########## python/docs/source/getting_started/install.rst: ########## @@ -165,16 +168,117 @@ To install PySpark from source, refer to |building_spark|_. Dependencies ------------ -========================== ========================= ====================================================================================== -Package Supported version Note -========================== ========================= ====================================================================================== -`py4j` >=0.10.9.7 Required -`pandas` >=1.4.4 Required for pandas API on Spark and Spark Connect; Optional for Spark SQL -`pyarrow` >=10.0.0 Required for pandas API on Spark and Spark Connect; Optional for Spark SQL -`numpy` >=1.21 Required for pandas API on Spark and MLLib DataFrame-based API; Optional for Spark SQL -`grpcio` >=1.62.0 Required for Spark Connect -`grpcio-status` >=1.62.0 Required for Spark Connect -`googleapis-common-protos` >=1.56.4 Required for Spark Connect -========================== ========================= ====================================================================================== + +Required dependencies +~~~~~~~~~~~~~~~~~~~~~ + +PySpark requires the following dependencies. + +========================== ========================= ============================================ +Package Supported version Note +========================== ========================= ============================================ +`py4j` >=0.10.9.7 Essential for Python to interface with the + Java objects in Spark; ensures seamless + interaction between Python and JVM. +========================== ========================= ============================================ + +Additional libraries that enhance functionality but are not included in the installation packages: + +- **memory-profiler**: Useful for diagnosing and analyzing memory usage in PySpark applications. Note that PySpark requires Java 17 or later with ``JAVA_HOME`` properly set and refer to |downloading|_. + + +.. _optional-dependencies: + +Optional dependencies +~~~~~~~~~~~~~~~~~~~~~ + +PySpark has several optional dependencies that enhance its functionality for specific modules. +These dependencies are only required for certain features and are not necessary for the basic functionality of PySpark. +If these optional dependencies are not installed, PySpark will function correctly for basic operations but will raise an ``ImportError`` +when you try to use features that require these dependencies. + +Spark SQL +^^^^^^^^^ + +Installable with ``pip install "pyspark[sql]"``. + +========= ================= ================================================================== +Package Supported version Note +========= ================= ================================================================== +`pandas` >=1.4.4 Enables seamless DataFrame operations between Spark and Pandas. +`pyarrow` >=10.0.0 Optimizes data conversion and transfer between PySpark and Pandas. +`numpy` >=1.21 Essential for numerical data manipulation within PySpark. +========= ================= ================================================================== + + +Pandas API on Spark +^^^^^^^^^^^^^^^^^^^ + +Installable with ``pip install "pyspark[pandas_on_spark]"``. + +========= ================= ===================================================================== +Package Supported version Note +========= ================= ===================================================================== +`pandas` >=1.4.4 Required for utilizing the Pandas API features in Spark. +`pyarrow` >=10.0.0 Ensures efficient data handling and performance in Pandas operations. +`numpy` >=1.21 Facilitates complex numerical operations within Spark. +========= ================= ===================================================================== + +Additional libraries that enhance functionality but are not included in the installation packages: + +- **mlflow**: Enhances machine learning lifecycle management, including experiment tracking and model deployment. Review Comment: Mention which API it uses -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
