Raj Raj created SPARK-35336:
-------------------------------

             Summary: Pyspark - Using importlib + filter + named function + 
take causes pyspark to restart continuously until machine runs out of memory
                 Key: SPARK-35336
                 URL: https://issues.apache.org/jira/browse/SPARK-35336
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.1.1, 3.0.0
            Reporter: Raj Raj


Repo to reproduce issue

[https://github.com/CanaryWharf/pyspark-mem-importlib-bug-reproduction]

 

Expected behavour:

Program runs and exits cleanly

 

Actual behaviour:

Program runs forever, eating up all the memory on the machine

 

Steps to reproduce:

```

pip install -r requirements.txt

python run.py

``` 

The problem only occurs if you run the code via `importlib`. The problem does 
not occur running `sparky.py` directly.

Furthermore, the problem occurs if you replace filter with map or flatMap 
(anything that takes in a lambda function).

The problem only occurs if you call a named function (i.e., when you use `def 
func`).

So these break:

```

def func(x):

    return True

 

dataset.filter(func)

 ```

 

```

def func(x):

    return True

 

dataset.filter(lambda x: func(x))

 ```

 

The problem does *NOT* occur if you do this:

```

dataset.filter(lambda x: True)

```

```

dataset.filter(lambda x: x == 'stuff')

 ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to