[ https://issues.apache.org/jira/browse/ARROW-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823055#comment-16823055 ]
Wes McKinney commented on ARROW-5144: ------------------------------------- [~mrocklin] We might be able to do a 0.13.1 release including this and some other fixes in a week or two. If that would be helpful let us know. FWIW, ParquetDatasetPiece is regarded as an implementation detail and wasn't intended necessarily to be serializable. So will keep this in mind for the future > [Python] ParquetDataset and ParquetPiece not serializable > --------------------------------------------------------- > > Key: ARROW-5144 > URL: https://issues.apache.org/jira/browse/ARROW-5144 > Project: Apache Arrow > Issue Type: Bug > Components: parquet, Python > Affects Versions: 0.13.0 > Environment: osx python36/conda cloudpickle 0.8.1 > arrow-cpp 0.13.0 py36ha71616b_0 conda-forge > pyarrow 0.13.0 py36hb37e6aa_0 conda-forge > Reporter: Martin Durant > Assignee: Krisztian Szucs > Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > Attachments: part.0.parquet > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Since 0.13.0, parquet instances are no longer serialisable, which means that > dask.distributed cannot pass them between processes in order to load parquet > in parallel. > Example: > ``` > >>> import cloudpickle > >>> import pyarrow.parquet as pq > >>> pf = pq.ParquetDataset('nation.impala.parquet') > >>> cloudpickle.dumps(pf) > ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py > in dumps(obj, protocol) > 893 try: > 894 cp = CloudPickler(file, protocol=protocol) > --> 895 cp.dump(obj) > 896 return file.getvalue() > 897 finally: > ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py > in dump(self, obj) > 266 self.inject_addons() > 267 try: > --> 268 return Pickler.dump(self, obj) > 269 except RuntimeError as e: > 270 if 'recursion' in e.args[0]: > ~/anaconda/envs/py36/lib/python3.6/pickle.py in dump(self, obj) > 407 if self.proto >= 4: > 408 self.framer.start_framing() > --> 409 self.save(obj) > 410 self.write(STOP) > 411 self.framer.end_framing() > ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, > save_persistent_id) > 519 > 520 # Save the reduce() output and finally memoize the object > --> 521 self.save_reduce(obj=obj, *rv) > 522 > 523 def persistent_id(self, obj): > ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_reduce(self, func, args, > state, listitems, dictitems, obj) > 632 > 633 if state is not None: > --> 634 save(state) > 635 write(BUILD) > 636 > ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, > save_persistent_id) > 474 f = self.dispatch.get(t) > 475 if f is not None: > --> 476 f(self, obj) # Call unbound method with explicit self > 477 return > 478 > ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_dict(self, obj) > 819 > 820 self.memoize(obj) > --> 821 self._batch_setitems(obj.items()) > 822 > 823 dispatch[dict] = save_dict > ~/anaconda/envs/py36/lib/python3.6/pickle.py in _batch_setitems(self, items) > 845 for k, v in tmp: > 846 save(k) > --> 847 save(v) > 848 write(SETITEMS) > 849 elif n: > ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, > save_persistent_id) > 494 reduce = getattr(obj, "__reduce_ex__", None) > 495 if reduce is not None: > --> 496 rv = reduce(self.proto) > 497 else: > 498 reduce = getattr(obj, "__reduce__", None) > ~/anaconda/envs/py36/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-darwin.so > in pyarrow._parquet.ParquetSchema.__reduce_cython__() > TypeError: no default __reduce__ due to non-trivial __cinit__ > ``` > The indicated schema instance is also referenced by the ParquetDatasetPiece s. > ref: https://github.com/dask/distributed/issues/2597 -- This message was sent by Atlassian JIRA (v7.6.3#76005)