[jira] [Updated] (SPARK-27810) PySpark breaks Cloudpickle serialization of collections.namedtuple objects

Travis Addair (JIRA) Wed, 22 May 2019 22:55:36 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-27810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Travis Addair updated SPARK-27810:
----------------------------------
    Description: 
After importing pyspark, cloudpickle is no longer able to properly serialize 
objects inheriting from collections.namedtuple, and drops all other class data 
such that calls to isinstance will fail.

Here's a minimal reproduction of the issue:

{{import collections}}
 {{import cloudpickle}}
 {{import pyspark}}

{{class A(object):}}
 {{    pass}}

{{class B(object):}}
 {{    pass}}

{{class C(A, B, collections.namedtuple('C', ['field'])):}}
 {{    pass}}

{{c = C(1)}}

{{def print_bases(obj):}}
 {{    bases = obj.__class__.__bases__}}
 {{    for base in bases:}}
 {{        print(base)}}

{{print('original objects:')}}
 {{print_bases(c)}}

{{print('\ncloudpickled objects:')}}
 {{c2 = cloudpickle.loads(cloudpickle.dumps(c))}}
 {{print_bases(c2)}}

This prints:

{{original objects:}}
{{<class '__main__.A'>}}
{{<class '__main__.B'>}}
{{<class 'collections.C'>}}

{{cloudpickled objects:}}
{{<class 'tuple'>}}

Effectively dropping all other types.  It appears this issue is being caused by 
the 
[_hijack_namedtuple|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L600]
 function, which replaces the namedtuple class with another one.

Note that I can workaround this issue by setting 
{{collections.namedtuple.__hijack = 1}} before importing pyspark, so I feel 
pretty confident this is what's causing the issue.

This issue comes up when working with [TensorFlow feature 
columns|https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/feature_column/feature_column.py],
 which derive from collections.namedtuple among other classes.

Cloudpickle also 
[supports|https://github.com/cloudpipe/cloudpickle/blob/3f4d9da8c567c8e0363880b760b789b40563f5c3/cloudpickle/cloudpickle.py#L900]
 collections.namedtuple serialization, but doesn't appear to need to replace 
the class.  Possibly PySpark can do something similar?

 

 

  was:
After importing pyspark, cloudpickle is no longer able to properly serialize 
objects inheriting from collections.namedtuple, and drops all other class data 
such that calls to isinstance will fail.

Here's a minimal reproduction of the issue:

{{import collections}}
 {{import cloudpickle}}
 {{import pyspark}}

{{class A(object):}}
 {{    pass}}

{{class B(object):}}
 {{    pass}}

{{class C(A, B, collections.namedtuple('C', ['field'])):}}
 {{    pass}}

{{c = C(1)}}

{{def print_bases(obj):}}
 {{    bases = obj.__class__.__bases__}}
 {{    for base in bases:}}
 {{        print(base)}}

{{print('original objects:')}}
 {{print_bases(c)}}

{{print('\ncloudpickled objects:')}}
 {{c2 = cloudpickle.loads(cloudpickle.dumps(c))}}
 {{print_bases(c2)}}

This prints:

{{original objects:}}
{{ <class '__main__.A'>}}
{{ <class '__main__.B'>}}
{{ <class 'collections.C'>}}

{{cloudpickled objects:}}
{{ <class 'tuple'>}}

Effectively dropping all other types.  It appears this issue is being caused by 
the 
[_hijack_namedtuple|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L600]
 function, which replaces the namedtuple class with another one.

Note that I can workaround this issue by setting 
{{collections.namedtuple.__hijack = 1}} before importing pyspark, so I feel 
pretty confident this is what's causing the issue.

This issue comes up when working with [TensorFlow feature 
columns|https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/feature_column/feature_column.py],
 which derive from collections.namedtuple among other classes.

Cloudpickle also 
[supports|https://github.com/cloudpipe/cloudpickle/blob/3f4d9da8c567c8e0363880b760b789b40563f5c3/cloudpickle/cloudpickle.py#L900]
 collections.namedtuple serialization, but doesn't appear to need to replace 
the class.  Possibly PySpark can do something similar?

 

 


> PySpark breaks Cloudpickle serialization of collections.namedtuple objects
> --------------------------------------------------------------------------
>
>                 Key: SPARK-27810
>                 URL: https://issues.apache.org/jira/browse/SPARK-27810
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.3
>            Reporter: Travis Addair
>            Priority: Major
>
> After importing pyspark, cloudpickle is no longer able to properly serialize 
> objects inheriting from collections.namedtuple, and drops all other class 
> data such that calls to isinstance will fail.
> Here's a minimal reproduction of the issue:
> {{import collections}}
>  {{import cloudpickle}}
>  {{import pyspark}}
> {{class A(object):}}
>  {{    pass}}
> {{class B(object):}}
>  {{    pass}}
> {{class C(A, B, collections.namedtuple('C', ['field'])):}}
>  {{    pass}}
> {{c = C(1)}}
> {{def print_bases(obj):}}
>  {{    bases = obj.__class__.__bases__}}
>  {{    for base in bases:}}
>  {{        print(base)}}
> {{print('original objects:')}}
>  {{print_bases(c)}}
> {{print('\ncloudpickled objects:')}}
>  {{c2 = cloudpickle.loads(cloudpickle.dumps(c))}}
>  {{print_bases(c2)}}
> This prints:
> {{original objects:}}
> {{<class '__main__.A'>}}
> {{<class '__main__.B'>}}
> {{<class 'collections.C'>}}
> {{cloudpickled objects:}}
> {{<class 'tuple'>}}
> Effectively dropping all other types.  It appears this issue is being caused 
> by the 
> [_hijack_namedtuple|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L600]
>  function, which replaces the namedtuple class with another one.
> Note that I can workaround this issue by setting 
> {{collections.namedtuple.__hijack = 1}} before importing pyspark, so I feel 
> pretty confident this is what's causing the issue.
> This issue comes up when working with [TensorFlow feature 
> columns|https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/feature_column/feature_column.py],
>  which derive from collections.namedtuple among other classes.
> Cloudpickle also 
> [supports|https://github.com/cloudpipe/cloudpickle/blob/3f4d9da8c567c8e0363880b760b789b40563f5c3/cloudpickle/cloudpickle.py#L900]
>  collections.namedtuple serialization, but doesn't appear to need to replace 
> the class.  Possibly PySpark can do something similar?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-27810) PySpark breaks Cloudpickle serialization of collections.namedtuple objects

Reply via email to