HyukjinKwon opened a new pull request, #43894:
URL: https://github.com/apache/spark/pull/43894

   ### What changes were proposed in this pull request?
   
   This PR improve the error messages for the dependency requirement for Python 
Spark Connect.
   
   ### Why are the changes needed?
   
   In order to improve error messages. This is what you get for now:
   
   ```
   /.../pyspark/shell.py:57: UserWarning: Failed to initialize Spark session.
     warnings.warn("Failed to initialize Spark session.")
   Traceback (most recent call last):
     File "/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/shell.py", 
line 52, in <module>
       spark = SparkSession.builder.getOrCreate()
     File 
"/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/session.py", 
line 476, in getOrCreate
       from pyspark.sql.connect.session import SparkSession as 
RemoteSparkSession
     File 
"/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/connect/session.py",
 line 53, in <module>
       from pyspark.sql.connect.client import SparkConnectClient, ChannelBuilder
     File 
"/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/connect/client/__init__.py",
 line 22, in <module>
       from pyspark.sql.connect.client.core import *  # noqa: F401,F403
     File 
"/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/connect/client/core.py",
 line 51, in <module>
       import google.protobuf.message
   ModuleNotFoundError: No module named 'google
   ```
   
   ```
   /.../pyspark/shell.py:57: UserWarning: Failed to initialize Spark session.
     warnings.warn("Failed to initialize Spark session.")
   Traceback (most recent call last):
     File "/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/shell.py", 
line 52, in <module>
       spark = SparkSession.builder.getOrCreate()
     File 
"/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/session.py", 
line 476, in getOrCreate
       from pyspark.sql.connect.session import SparkSession as 
RemoteSparkSession
     File 
"/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/connect/session.py",
 line 53, in <module>
       from pyspark.sql.connect.client import SparkConnectClient, ChannelBuilder
     File 
"/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/connect/client/__init__.py",
 line 22, in <module>
       from pyspark.sql.connect.client.core import *  # noqa: F401,F403
     File 
"/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/connect/client/core.py",
 line 52, in <module>
       from grpc_status import rpc_status
   ModuleNotFoundError: No module named 'grpc_status'
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it changes the user-facing error messages.
   
   ### How was this patch tested?
   
   Manually tested as below:
   
   ```bash
   ➜  spark git:(master) ✗ conda create -y -n python3.10 python=3.10
   ...
   ➜  spark git:(master) ✗ conda activate python3.10
   (python3.10) ➜  spark git:(master) ✗ ./bin/pyspark --remote local
   ...
       raise ImportError(
   ImportError: Pandas >= 1.4.4 must be installed; however, it was not found.
   (python3.10) ➜  spark git:(master) ✗ pip install 'pandas >= 1.4.4'
   ...
   (python3.10) ➜  spark git:(SPARK-45996) ✗ ./bin/pyspark --remote local
   ...
       raise ImportError(
   ImportError: PyArrow >= 4.0.0 must be installed; however, it was not found.
   (python3.10) ➜  spark git:(SPARK-45996) pip install 'PyArrow >= 4.0.0'
   ...
   (python3.10) ➜  spark git:(SPARK-45996) ./bin/pyspark --remote local
   ...
       raise ImportError(
   ImportError: grpcio >= 1.48.1 must be installed; however, it was not found.
   (python3.10) ➜  spark git:(SPARK-45996) pip install 'grpcio >= 1.48.1'
   ...
   (python3.10) ➜  spark git:(SPARK-45996) ./bin/pyspark --remote local
   ...
       raise ImportError(
   ImportError: grpc-status >= 1.48.1 must be installed; however, it was not 
found.
   (python3.10) ➜  spark git:(SPARK-45996) ✗ pip install 'grpcio-status >= 
1.48.1'
   ...
   (python3.10) ➜  spark git:(SPARK-45996) ✗ ./bin/pyspark --remote local
   ...
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /__ / .__/\_,_/_/ /_/\_\   version 4.0.0.dev0
         /_/
   
   Using Python version 3.10.13 (main, Sep 11 2023 08:39:02)
   Client connected to the Spark Connect server at localhost
   SparkSession available as 'spark'.
   >>> spark.range(10).show()
   +---+
   | id|
   +---+
   |  0|
   ...
   ```
   
   Note that `grpcio-status` includes the common `googleapis-common-protos` 
(see 
https://github.com/grpc/grpc/blob/master/src/python/grpcio_status/setup.py#L67-L69)
 so it wasn't explicitly installed.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to