This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 8e83ab7de6b [SPARK-42266][PYTHON] Remove the parent directory in
shell.py execution when IPython is used
8e83ab7de6b is described below
commit 8e83ab7de6b362df37741ba2ec944d53de95c51c
Author: Hyukjin Kwon <[email protected]>
AuthorDate: Wed Mar 8 19:38:49 2023 +0900
[SPARK-42266][PYTHON] Remove the parent directory in shell.py execution
when IPython is used
### What changes were proposed in this pull request?
This PR proposes to remove the parent directory in `shell.py` execution
when IPython is used.
This is a general issue for PySpark shell specifically with IPython -
IPython temporarily adds the parent directory of the script into the Python
path (`sys.path`), which results in searching packages under `pyspark`
directory. For example, `import pandas` attempts to import `pyspark.pandas`.
So far, we haven't had such cases within PySpark itself importing code
path, but Spark Connect now has the case via checking dependency checking
(which attempts to import pandas) which exposes the actual problem.
Running it with IPython can easily reproduce the error:
```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```
### Why are the changes needed?
To make PySpark shell properly import other packages even when the names
conflict with subpackages (e.g., `pyspark.pandas` vs `pandas`)
### Does this PR introduce _any_ user-facing change?
No to the end users:
- Because this path is only inserted for `shell.py` execution, and
thankfully we didn't have such relative import case so far.
- It fixes the issue in the unreleased, Spark Connect.
### How was this patch tested?
Manually tested.
```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```
**Before:**
```
Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
/.../spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize
Spark session.
warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
File "/.../spark/python/pyspark/shell.py", line 40, in <module>
spark = SparkSession.builder.getOrCreate()
File "/.../spark/python/pyspark/sql/session.py", line 437, in getOrCreate
from pyspark.sql.connect.session import SparkSession as
RemoteSparkSession
File "/.../spark/python/pyspark/sql/connect/session.py", line 19, in
<module>
check_dependencies(__name__, __file__)
File "/.../spark/python/pyspark/sql/connect/utils.py", line 33, in
check_dependencies
require_minimum_pandas_version()
File "/.../spark/python/pyspark/sql/pandas/utils.py", line 27, in
require_minimum_pandas_version
import pandas
File "/.../spark/python/pyspark/pandas/__init__.py", line 29, in <module>
from pyspark.pandas.missing.general_functions import
MissingPandasLikeGeneralFunctions
File "/.../spark/python/pyspark/pandas/__init__.py", line 34, in <module>
require_minimum_pandas_version()
File "/.../spark/python/pyspark/sql/pandas/utils.py", line 37, in
require_minimum_pandas_version
if LooseVersion(pandas.__version__) <
LooseVersion(minimum_pandas_version):
AttributeError: partially initialized module 'pandas' has no attribute
'__version__' (most likely due to a circular import)
...
```
**After:**
```
Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
23/03/08 13:30:51 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.0.dev0
/_/
Using Python version 3.9.16 (main, Feb 1 2023 21:42:20)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.
In [1]:
```
Closes #40327 from HyukjinKwon/SPARK-42266.
Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/pyspark/shell.py | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py
index 8613d2d09ea..86f576a3029 100644
--- a/python/pyspark/shell.py
+++ b/python/pyspark/shell.py
@@ -22,9 +22,11 @@ This file is designed to be launched as a PYTHONSTARTUP
script.
"""
import atexit
+import builtins
import os
import platform
import warnings
+import sys
import pyspark
from pyspark.context import SparkContext
@@ -33,6 +35,16 @@ from pyspark.sql.context import SQLContext
from pyspark.sql.utils import is_remote
from urllib.parse import urlparse
+if getattr(builtins, "__IPYTHON__", False):
+ # (Only) during PYTHONSTARTUP execution, IPython temporarily adds the
parent
+ # directory of the script into the Python path, which results in searching
+ # packages under `pyspark` directory.
+ # For example, `import pandas` attempts to import `pyspark.pandas`, see
also SPARK-42266.
+ if "__file__" in globals():
+ parent_dir = os.path.abspath(os.path.dirname(__file__))
+ if parent_dir in sys.path:
+ sys.path.remove(parent_dir)
+
if is_remote():
try:
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]