[spark] branch master updated: [SPARK-42266][PYTHON] Remove the parent directory in shell.py execution when IPython is used

gurwls223 Wed, 08 Mar 2023 02:39:07 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 8e83ab7de6b [SPARK-42266][PYTHON] Remove the parent directory in 
shell.py execution when IPython is used
8e83ab7de6b is described below

commit 8e83ab7de6b362df37741ba2ec944d53de95c51c
Author: Hyukjin Kwon <[email protected]>
AuthorDate: Wed Mar 8 19:38:49 2023 +0900

    [SPARK-42266][PYTHON] Remove the parent directory in shell.py execution 
when IPython is used
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to remove the parent directory in `shell.py` execution 
when IPython is used.
    
    This is a general issue for PySpark shell specifically with IPython - 
IPython temporarily adds the parent directory of the script into the Python 
path (`sys.path`), which results in searching packages under `pyspark` 
directory. For example, `import pandas` attempts to import `pyspark.pandas`.
    
    So far, we haven't had such cases within PySpark itself importing code 
path, but Spark Connect now has the case via checking dependency checking 
(which attempts to import pandas) which exposes the actual problem.
    
    Running it with IPython can easily reproduce the error:
    
    ```bash
    PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
    ```
    
    ### Why are the changes needed?
    
    To make PySpark shell properly import other packages even when the names 
conflict with subpackages (e.g., `pyspark.pandas` vs `pandas`)
    
    ### Does this PR introduce _any_ user-facing change?
    
    No to the end users:
    - Because this path is only inserted for `shell.py` execution, and 
thankfully we didn't have such relative import case so far.
    - It fixes the issue in the unreleased, Spark Connect.
    
    ### How was this patch tested?
    
    Manually tested.
    
    ```bash
    PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
    ```
    
    **Before:**
    
    ```
    Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
    Type 'copyright', 'credits' or 'license' for more information
    IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
    /.../spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize 
Spark session.
      warnings.warn("Failed to initialize Spark session.")
    Traceback (most recent call last):
      File "/.../spark/python/pyspark/shell.py", line 40, in <module>
        spark = SparkSession.builder.getOrCreate()
      File "/.../spark/python/pyspark/sql/session.py", line 437, in getOrCreate
        from pyspark.sql.connect.session import SparkSession as 
RemoteSparkSession
      File "/.../spark/python/pyspark/sql/connect/session.py", line 19, in 
<module>
        check_dependencies(__name__, __file__)
      File "/.../spark/python/pyspark/sql/connect/utils.py", line 33, in 
check_dependencies
        require_minimum_pandas_version()
      File "/.../spark/python/pyspark/sql/pandas/utils.py", line 27, in 
require_minimum_pandas_version
        import pandas
      File "/.../spark/python/pyspark/pandas/__init__.py", line 29, in <module>
        from pyspark.pandas.missing.general_functions import 
MissingPandasLikeGeneralFunctions
      File "/.../spark/python/pyspark/pandas/__init__.py", line 34, in <module>
        require_minimum_pandas_version()
      File "/.../spark/python/pyspark/sql/pandas/utils.py", line 37, in 
require_minimum_pandas_version
        if LooseVersion(pandas.__version__) < 
LooseVersion(minimum_pandas_version):
    AttributeError: partially initialized module 'pandas' has no attribute 
'__version__' (most likely due to a circular import)
    ...
    ```
    
    **After:**
    
    ```
    Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
    Type 'copyright', 'credits' or 'license' for more information
    IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
    23/03/08 13:30:51 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 3.5.0.dev0
          /_/
    
    Using Python version 3.9.16 (main, Feb  1 2023 21:42:20)
    Client connected to the Spark Connect server at localhost
    SparkSession available as 'spark'.
    
    In [1]:
    ```
    
    Closes #40327 from HyukjinKwon/SPARK-42266.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 python/pyspark/shell.py | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py
index 8613d2d09ea..86f576a3029 100644
--- a/python/pyspark/shell.py
+++ b/python/pyspark/shell.py
@@ -22,9 +22,11 @@ This file is designed to be launched as a PYTHONSTARTUP 
script.
 """
 
 import atexit
+import builtins
 import os
 import platform
 import warnings
+import sys
 
 import pyspark
 from pyspark.context import SparkContext
@@ -33,6 +35,16 @@ from pyspark.sql.context import SQLContext
 from pyspark.sql.utils import is_remote
 from urllib.parse import urlparse
 
+if getattr(builtins, "__IPYTHON__", False):
+    # (Only) during PYTHONSTARTUP execution, IPython temporarily adds the 
parent
+    # directory of the script into the Python path, which results in searching
+    # packages under `pyspark` directory.
+    # For example, `import pandas` attempts to import `pyspark.pandas`, see 
also SPARK-42266.
+    if "__file__" in globals():
+        parent_dir = os.path.abspath(os.path.dirname(__file__))
+        if parent_dir in sys.path:
+            sys.path.remove(parent_dir)
+
 
 if is_remote():
     try:


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch master updated: [SPARK-42266][PYTHON] Remove the parent directory in shell.py execution when IPython is used

Reply via email to