[jira] [Commented] (ARROW-5144) [Python] ParquetDataset and ParquetPiece not serializable

Wes McKinney (JIRA) Mon, 22 Apr 2019 05:05:33 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823055#comment-16823055
 ]


Wes McKinney commented on ARROW-5144:
-------------------------------------

[~mrocklin] We might be able to do a 0.13.1 release including this and some 
other fixes in a week or two. If that would be helpful let us know. 

FWIW, ParquetDatasetPiece is regarded as an implementation detail and wasn't 
intended necessarily to be serializable. So will keep this in mind for the 
future

> [Python] ParquetDataset and ParquetPiece not serializable
> ---------------------------------------------------------
>
>                 Key: ARROW-5144
>                 URL: https://issues.apache.org/jira/browse/ARROW-5144
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: parquet, Python
>    Affects Versions: 0.13.0
>         Environment: osx python36/conda cloudpickle 0.8.1
> arrow-cpp                 0.13.0           py36ha71616b_0    conda-forge
> pyarrow                   0.13.0           py36hb37e6aa_0    conda-forge
>            Reporter: Martin Durant
>            Assignee: Krisztian Szucs
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.14.0
>
>         Attachments: part.0.parquet
>
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Since 0.13.0, parquet instances are no longer serialisable, which means that 
> dask.distributed cannot pass them between processes in order to load parquet 
> in parallel.
> Example:
> ```
> >>> import cloudpickle
> >>> import pyarrow.parquet as pq
> >>> pf = pq.ParquetDataset('nation.impala.parquet')
> >>> cloudpickle.dumps(pf)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dumps(obj, protocol)
>     893     try:
>     894         cp = CloudPickler(file, protocol=protocol)
> --> 895         cp.dump(obj)
>     896         return file.getvalue()
>     897     finally:
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dump(self, obj)
>     266         self.inject_addons()
>     267         try:
> --> 268             return Pickler.dump(self, obj)
>     269         except RuntimeError as e:
>     270             if 'recursion' in e.args[0]:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in dump(self, obj)
>     407         if self.proto >= 4:
>     408             self.framer.start_framing()
> --> 409         self.save(obj)
>     410         self.write(STOP)
>     411         self.framer.end_framing()
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
>     519
>     520         # Save the reduce() output and finally memoize the object
> --> 521         self.save_reduce(obj=obj, *rv)
>     522
>     523     def persistent_id(self, obj):
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_reduce(self, func, args, 
> state, listitems, dictitems, obj)
>     632
>     633         if state is not None:
> --> 634             save(state)
>     635             write(BUILD)
>     636
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
>     474         f = self.dispatch.get(t)
>     475         if f is not None:
> --> 476             f(self, obj) # Call unbound method with explicit self
>     477             return
>     478
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_dict(self, obj)
>     819
>     820         self.memoize(obj)
> --> 821         self._batch_setitems(obj.items())
>     822
>     823     dispatch[dict] = save_dict
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in _batch_setitems(self, items)
>     845                 for k, v in tmp:
>     846                     save(k)
> --> 847                     save(v)
>     848                 write(SETITEMS)
>     849             elif n:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
>     494             reduce = getattr(obj, "__reduce_ex__", None)
>     495             if reduce is not None:
> --> 496                 rv = reduce(self.proto)
>     497             else:
>     498                 reduce = getattr(obj, "__reduce__", None)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-darwin.so
>  in pyarrow._parquet.ParquetSchema.__reduce_cython__()
> TypeError: no default __reduce__ due to non-trivial __cinit__
> ```
> The indicated schema instance is also referenced by the ParquetDatasetPiece s.
> ref: https://github.com/dask/distributed/issues/2597



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-5144) [Python] ParquetDataset and ParquetPiece not serializable

Reply via email to