[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses

Hyukjin Kwon (Jira) Sun, 05 Sep 2021 18:52:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410280#comment-17410280
 ]


Hyukjin Kwon commented on SPARK-22674:
--------------------------------------

i plan to fix it in Spark 3.3 fwiw.

> PySpark breaks serialization of namedtuple subclasses
> -----------------------------------------------------
>
>                 Key: SPARK-22674
>                 URL: https://issues.apache.org/jira/browse/SPARK-22674
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.0, 2.3.0, 3.1.1
>            Reporter: Jonas Amrich
>            Priority: Major
>
> Pyspark monkey patches the namedtuple class to make it serializable, however 
> this breaks serialization of its subclasses. With current implementation, any 
> subclass will be serialized (and deserialized) as it's parent namedtuple. 
> Consider this code, which will fail with {{AttributeError: 'Point' object has 
> no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
>     def sum(self):
>         return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing 
> pyspark breaks serialization of namedtuple subclasses even in code which is 
> not related to spark / distributed execution. I don't see any clean solution 
> to this; a possible workaround may be to limit serialization hack only to 
> direct namedtuple subclasses like in 
> https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses

Reply via email to