[ https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276390#comment-16276390 ]
Hyukjin Kwon commented on SPARK-22674: -------------------------------------- Not really. Regular pickle can pickle it but is unable to unpickle it without explicit class declaration: {code} import pickle from collections import namedtuple Point = namedtuple("Point", "x y") class PointSubclass(Point): def sum(self): return self.x + self.y w = open("foo", "w") w.write(pickle.dumps(PointSubclass(1, 2))) exit(0) pickle.loads(open("foo").read()) {code} Open {{python}} again: {code} import pickle pickle.loads(open("foo").read()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../2.7/lib/python2.7/pickle.py", line 1382, in loads return Unpickler(file).load() File "/.../2.7/lib/python2.7/pickle.py", line 858, in load dispatch[key](self) File "/.../2.7/lib/python2.7/pickle.py", line 1090, in load_global klass = self.find_class(module, name) File "/.../2.7/lib/python2.7/pickle.py", line 1126, in find_class klass = getattr(mod, name) AttributeError: 'module' object has no attribute 'PointSubclass' {code} After the fix in cloudpickle: {code} import cloudpickle from collections import namedtuple Point = namedtuple("Point", "x y") class PointSubclass(Point): def sum(self): return self.x + self.y w = open("foo", "w") w.write(cloudpickle.dumps(PointSubclass(1, 2))) exit(0) {code} Open {{python}} again: {code} import pickle pickle.loads(open("foo").read()) {code} PySpark's cloudpickle fixed it in a hijacking way to avoid this problem first before cloudpickle fixes it. After that, cloudpickle introduced the fix I pointed out. > PySpark breaks serialization of namedtuple subclasses > ----------------------------------------------------- > > Key: SPARK-22674 > URL: https://issues.apache.org/jira/browse/SPARK-22674 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.2.0 > Reporter: Jonas Amrich > > Pyspark monkey patches the namedtuple class to make it serializable, however > this breaks serialization of its subclasses. With current implementation, any > subclass will be serialized (and deserialized) as it's parent namedtuple. > Consider this code, which will fail with {{AttributeError: 'Point' object has > no attribute 'sum'}}: > {code} > from collections import namedtuple > Point = namedtuple("Point", "x y") > class PointSubclass(Point): > def sum(self): > return self.x + self.y > rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]]) > rdd.collect()[0][0].sum() > {code} > Moreover, as PySpark hijacks all namedtuples in the main module, importing > pyspark breaks serialization of namedtuple subclasses even in code which is > not related to spark / distributed execution. I don't see any clean solution > to this; a possible workaround may be to limit serialization hack only to > direct namedtuple subclasses like in > https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204 -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org