[
https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410280#comment-17410280
]
Hyukjin Kwon commented on SPARK-22674:
--------------------------------------
i plan to fix it in Spark 3.3 fwiw.
> PySpark breaks serialization of namedtuple subclasses
> -----------------------------------------------------
>
> Key: SPARK-22674
> URL: https://issues.apache.org/jira/browse/SPARK-22674
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.2.0, 2.3.0, 3.1.1
> Reporter: Jonas Amrich
> Priority: Major
>
> Pyspark monkey patches the namedtuple class to make it serializable, however
> this breaks serialization of its subclasses. With current implementation, any
> subclass will be serialized (and deserialized) as it's parent namedtuple.
> Consider this code, which will fail with {{AttributeError: 'Point' object has
> no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
> def sum(self):
> return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing
> pyspark breaks serialization of namedtuple subclasses even in code which is
> not related to spark / distributed execution. I don't see any clean solution
> to this; a possible workaround may be to limit serialization hack only to
> direct namedtuple subclasses like in
> https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]