Raj Raj created SPARK-35336:
-------------------------------
Summary: Pyspark - Using importlib + filter + named function +
take causes pyspark to restart continuously until machine runs out of memory
Key: SPARK-35336
URL: https://issues.apache.org/jira/browse/SPARK-35336
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.1.1, 3.0.0
Reporter: Raj Raj
Repo to reproduce issue
[https://github.com/CanaryWharf/pyspark-mem-importlib-bug-reproduction]
Expected behavour:
Program runs and exits cleanly
Actual behaviour:
Program runs forever, eating up all the memory on the machine
Steps to reproduce:
```
pip install -r requirements.txt
python run.py
```
The problem only occurs if you run the code via `importlib`. The problem does
not occur running `sparky.py` directly.
Furthermore, the problem occurs if you replace filter with map or flatMap
(anything that takes in a lambda function).
The problem only occurs if you call a named function (i.e., when you use `def
func`).
So these break:
```
def func(x):
return True
dataset.filter(func)
```
```
def func(x):
return True
dataset.filter(lambda x: func(x))
```
The problem does *NOT* occur if you do this:
```
dataset.filter(lambda x: True)
```
```
dataset.filter(lambda x: x == 'stuff')
```
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]