[
https://issues.apache.org/jira/browse/SPARK-30473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Härtwig updated SPARK-30473:
--------------------------------
Description:
PySpark enum subclass crashes when used inside a UDF.
Example:
{code:java}
from enum import Enum
class Direction(Enum):
NORTH = 0
SOUTH = 1
{code}
Working:
{code:java}
Direction.NORTH{code}
Crashing:
{code:java}
@udf
def fn(a):
Direction.NORTH
return ""
df.withColumn("test", fn("a")){code}
Stacktrace:
{noformat}
SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4
times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21,
executor 0): org.apache.spark.api.python.PythonException: Traceback (most
recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in
_read_with_length return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__
enum_members = {k: classdict[k] for k in classdict._member_names}
AttributeError: 'dict' object has no attribute '_member_names'{noformat}
I suspect the problem is in *python/pyspark/cloudpickle.py*. On line 586 in the
function *_save_dynamic_enum*, the attribute *_member_names* is removed from
the enum. Yet, this attribute is required by the *Enum* class. This results in
all Enum subclasses crashing.
was:
PySpark enum subclass crashes when used inside a UDF.
Example:
{code:java}
from enum import Enum
class Direction(Enum):
NORTH = 0
SOUTH = 1
{code}
Working:
{code:java}
Direction.NORTH{code}
Crashing:
{code:java}
@udf
def fn(a):
Direction.NORTH
return ""
df.withColumn("test", fn("a")){code}
Stacktrace:
{noformat}
SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4
times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21,
executor 0): org.apache.spark.api.python.PythonException: Traceback (most
recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in
_read_with_length return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__
enum_members = {k: classdict[k] for k in classdict._member_names}
AttributeError: 'dict' object has no attribute '_member_names'{noformat}
I suspect the problem is in `python/pyspark/cloudpickle.py`. On line 586 in the
function `_save_dynamic_enum`, the attribute `_member_names` is removed from
the enum. Yet, this attribute is required by the `Enum` class and Enum
subclasses will crash.
> PySpark enum subclass crashes when used inside UDF
> --------------------------------------------------
>
> Key: SPARK-30473
> URL: https://issues.apache.org/jira/browse/SPARK-30473
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.4.4
> Environment: Databricks Runtime 6.2 (includes Apache Spark 2.4.4,
> Scala 2.11)
> Reporter: Max Härtwig
> Priority: Major
>
> PySpark enum subclass crashes when used inside a UDF.
>
> Example:
> {code:java}
> from enum import Enum
> class Direction(Enum):
> NORTH = 0
> SOUTH = 1
> {code}
>
> Working:
> {code:java}
> Direction.NORTH{code}
>
> Crashing:
> {code:java}
> @udf
> def fn(a):
> Direction.NORTH
> return ""
> df.withColumn("test", fn("a")){code}
>
> Stacktrace:
> {noformat}
> SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed
> 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235,
> 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException:
> Traceback (most recent call last):
> File "/databricks/spark/python/pyspark/serializers.py", line 182, in
> _read_with_length return self.loads(obj)
> File "/databricks/spark/python/pyspark/serializers.py", line 695, in
> loads return pickle.loads(obj, encoding=encoding)
> File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__
> enum_members = {k: classdict[k] for k in classdict._member_names}
> AttributeError: 'dict' object has no attribute '_member_names'{noformat}
>
> I suspect the problem is in *python/pyspark/cloudpickle.py*. On line 586 in
> the function *_save_dynamic_enum*, the attribute *_member_names* is removed
> from the enum. Yet, this attribute is required by the *Enum* class. This
> results in all Enum subclasses crashing.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]