[ 
https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276390#comment-16276390
 ] 

Hyukjin Kwon commented on SPARK-22674:
--------------------------------------

Not really. Regular pickle can pickle it but is unable to unpickle it without 
explicit class declaration:


{code}
import pickle
from collections import namedtuple


Point = namedtuple("Point", "x y")

class PointSubclass(Point):
    def sum(self):
        return self.x + self.y

w = open("foo", "w")
w.write(pickle.dumps(PointSubclass(1, 2)))
exit(0)

pickle.loads(open("foo").read())
{code}

Open {{python}} again:

{code}
import pickle
pickle.loads(open("foo").read())

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../2.7/lib/python2.7/pickle.py", line 1382, in loads
    return Unpickler(file).load()
  File "/.../2.7/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/.../2.7/lib/python2.7/pickle.py", line 1090, in load_global
    klass = self.find_class(module, name)
  File "/.../2.7/lib/python2.7/pickle.py", line 1126, in find_class
    klass = getattr(mod, name)
AttributeError: 'module' object has no attribute 'PointSubclass'
{code}


After the fix in cloudpickle:

{code}
import cloudpickle
from collections import namedtuple


Point = namedtuple("Point", "x y")

class PointSubclass(Point):
    def sum(self):
        return self.x + self.y

w = open("foo", "w")
w.write(cloudpickle.dumps(PointSubclass(1, 2)))
exit(0)
{code}

Open {{python}} again:

{code}
import pickle
pickle.loads(open("foo").read())
{code}

PySpark's cloudpickle fixed it in a hijacking way to avoid this problem first 
before cloudpickle fixes it. After that, cloudpickle introduced the fix I 
pointed out.


> PySpark breaks serialization of namedtuple subclasses
> -----------------------------------------------------
>
>                 Key: SPARK-22674
>                 URL: https://issues.apache.org/jira/browse/SPARK-22674
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.0
>            Reporter: Jonas Amrich
>
> Pyspark monkey patches the namedtuple class to make it serializable, however 
> this breaks serialization of its subclasses. With current implementation, any 
> subclass will be serialized (and deserialized) as it's parent namedtuple. 
> Consider this code, which will fail with {{AttributeError: 'Point' object has 
> no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
>     def sum(self):
>         return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing 
> pyspark breaks serialization of namedtuple subclasses even in code which is 
> not related to spark / distributed execution. I don't see any clean solution 
> to this; a possible workaround may be to limit serialization hack only to 
> direct namedtuple subclasses like in 
> https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to