[
https://issues.apache.org/jira/browse/SPARK-27810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Travis Addair updated SPARK-27810:
----------------------------------
Description:
After importing pyspark, cloudpickle is no longer able to properly serialize
objects inheriting from collections.namedtuple, and drops all other class data
such that calls to isinstance will fail.
Here's a minimal reproduction of the issue:
{{import collections}}
{{import cloudpickle}}
{{import pyspark}}
{{class A(object):}}
{{ pass}}
{{class B(object):}}
{{ pass}}
{{class C(A, B, collections.namedtuple('C', ['field'])):}}
{{ pass}}
{{c = C(1)}}
{{def print_bases(obj):}}
{{ bases = obj.__class__.__bases__}}
{{ for base in bases:}}
{{ print(base)}}
{{print('original objects:')}}
{{print_bases(c)}}
{{print('\ncloudpickled objects:')}}
{{c2 = cloudpickle.loads(cloudpickle.dumps(c))}}
{{print_bases(c2)}}
This prints:
{{original objects:}}
{{<class '__main__.A'>}}
{{<class '__main__.B'>}}
{{<class 'collections.C'>}}
{{cloudpickled objects:}}
{{<class 'tuple'>}}
Effectively dropping all other types. It appears this issue is being caused by
the
[_hijack_namedtuple|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L600]
function, which replaces the namedtuple class with another one.
Note that I can workaround this issue by setting
{{collections.namedtuple.__hijack = 1}} before importing pyspark, so I feel
pretty confident this is what's causing the issue.
This issue comes up when working with [TensorFlow feature
columns|https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/feature_column/feature_column.py],
which derive from collections.namedtuple among other classes.
Cloudpickle also
[supports|https://github.com/cloudpipe/cloudpickle/blob/3f4d9da8c567c8e0363880b760b789b40563f5c3/cloudpickle/cloudpickle.py#L900]
collections.namedtuple serialization, but doesn't appear to need to replace
the class. Possibly PySpark can do something similar?
was:
After importing pyspark, cloudpickle is no longer able to properly serialize
objects inheriting from collections.namedtuple, and drops all other class data
such that calls to isinstance will fail.
Here's a minimal reproduction of the issue:
{{import collections}}
{{import cloudpickle}}
{{import pyspark}}
{{class A(object):}}
{{ pass}}
{{class B(object):}}
{{ pass}}
{{class C(A, B, collections.namedtuple('C', ['field'])):}}
{{ pass}}
{{c = C(1)}}
{{def print_bases(obj):}}
{{ bases = obj.__class__.__bases__}}
{{ for base in bases:}}
{{ print(base)}}
{{print('original objects:')}}
{{print_bases(c)}}
{{print('\ncloudpickled objects:')}}
{{c2 = cloudpickle.loads(cloudpickle.dumps(c))}}
{{print_bases(c2)}}
This prints:
{{original objects:}}
{{ <class '__main__.A'>}}
{{ <class '__main__.B'>}}
{{ <class 'collections.C'>}}
{{cloudpickled objects:}}
{{ <class 'tuple'>}}
Effectively dropping all other types. It appears this issue is being caused by
the
[_hijack_namedtuple|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L600]
function, which replaces the namedtuple class with another one.
Note that I can workaround this issue by setting
{{collections.namedtuple.__hijack = 1}} before importing pyspark, so I feel
pretty confident this is what's causing the issue.
This issue comes up when working with [TensorFlow feature
columns|https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/feature_column/feature_column.py],
which derive from collections.namedtuple among other classes.
Cloudpickle also
[supports|https://github.com/cloudpipe/cloudpickle/blob/3f4d9da8c567c8e0363880b760b789b40563f5c3/cloudpickle/cloudpickle.py#L900]
collections.namedtuple serialization, but doesn't appear to need to replace
the class. Possibly PySpark can do something similar?
> PySpark breaks Cloudpickle serialization of collections.namedtuple objects
> --------------------------------------------------------------------------
>
> Key: SPARK-27810
> URL: https://issues.apache.org/jira/browse/SPARK-27810
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.4.3
> Reporter: Travis Addair
> Priority: Major
>
> After importing pyspark, cloudpickle is no longer able to properly serialize
> objects inheriting from collections.namedtuple, and drops all other class
> data such that calls to isinstance will fail.
> Here's a minimal reproduction of the issue:
> {{import collections}}
> {{import cloudpickle}}
> {{import pyspark}}
> {{class A(object):}}
> {{ pass}}
> {{class B(object):}}
> {{ pass}}
> {{class C(A, B, collections.namedtuple('C', ['field'])):}}
> {{ pass}}
> {{c = C(1)}}
> {{def print_bases(obj):}}
> {{ bases = obj.__class__.__bases__}}
> {{ for base in bases:}}
> {{ print(base)}}
> {{print('original objects:')}}
> {{print_bases(c)}}
> {{print('\ncloudpickled objects:')}}
> {{c2 = cloudpickle.loads(cloudpickle.dumps(c))}}
> {{print_bases(c2)}}
> This prints:
> {{original objects:}}
> {{<class '__main__.A'>}}
> {{<class '__main__.B'>}}
> {{<class 'collections.C'>}}
> {{cloudpickled objects:}}
> {{<class 'tuple'>}}
> Effectively dropping all other types. It appears this issue is being caused
> by the
> [_hijack_namedtuple|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L600]
> function, which replaces the namedtuple class with another one.
> Note that I can workaround this issue by setting
> {{collections.namedtuple.__hijack = 1}} before importing pyspark, so I feel
> pretty confident this is what's causing the issue.
> This issue comes up when working with [TensorFlow feature
> columns|https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/feature_column/feature_column.py],
> which derive from collections.namedtuple among other classes.
> Cloudpickle also
> [supports|https://github.com/cloudpipe/cloudpickle/blob/3f4d9da8c567c8e0363880b760b789b40563f5c3/cloudpickle/cloudpickle.py#L900]
> collections.namedtuple serialization, but doesn't appear to need to replace
> the class. Possibly PySpark can do something similar?
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]