[
https://issues.apache.org/jira/browse/PARQUET-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829371#comment-16829371
]
John Adcock commented on PARQUET-1405:
--------------------------------------
I'm being hit by this issue, I'm happy to try and help resolve it but would
appreciate any tips on getting started with diagnosing where the problem is.
Is the column statistics comment above accurate?
> [C++] 'Couldn't deserialize thrift' error when reading large binary column
> --------------------------------------------------------------------------
>
> Key: PARQUET-1405
> URL: https://issues.apache.org/jira/browse/PARQUET-1405
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3
> Reporter: Jeremy Heffner
> Priority: Major
> Labels: parquet
> Fix For: cpp-1.6.0
>
> Attachments: parquet-issue-example.py
>
>
> We've run into issues reading Parquet files that contain long binary columns
> (utf8 strings). In particular, we were generating WKT representations of
> polygons that contained ~34 million characters when we ran into the issue.
> The attached example generates a dataframe with one record and one column
> containing a random string with 10^7 characters.
> Pandas (using the default pyarrow engine) successfully writes the file, but
> fails upon reading the file:
> {code:java}
> ---------------------------------------------------------------------------
> ArrowIOError Traceback (most recent call last)
> <ipython-input-25-25d21204cbad> in <module>()
> ----> 1 df_read_in = pd.read_parquet('test.parquet')
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in
> read_parquet(path, engine, columns, **kwargs)
> 286
> 287 impl = get_engine(engine)
> --> 288 return impl.read(path, columns=columns, **kwargs)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in
> read(self, path, columns, **kwargs)
> 129 kwargs['use_pandas_metadata'] = True
> 130 result = self.api.parquet.read_table(path, columns=columns,
> --> 131 **kwargs).to_pandas()
> 132 if should_close:
> 133 try:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in
> read_table(source, columns, nthreads, metadata, use_pandas_metadata)
> 1044 fs = _get_fs_from_path(source)
> 1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
> -> 1046 use_pandas_metadata=use_pandas_metadata)
> 1047
> 1048 pf = ParquetFile(source, metadata=metadata)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in
> read_parquet(self, path, columns, metadata, schema, nthreads,
> use_pandas_metadata)
> 175 filesystem=self)
> 176 return dataset.read(columns=columns, nthreads=nthreads,
> --> 177 use_pandas_metadata=use_pandas_metadata)
> 178
> 179 def open(self, path, mode='rb'):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in
> read(self, columns, nthreads, use_pandas_metadata)
> 896 partitions=self.partitions,
> 897 open_file_func=open_file,
> --> 898 use_pandas_metadata=use_pandas_metadata)
> 899 tables.append(table)
> 900
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in
> read(self, columns, nthreads, partitions, open_file_func, file,
> use_pandas_metadata)
> 459 table = reader.read_row_group(self.row_group, **options)
> 460 else:
> --> 461 table = reader.read(**options)
> 462
> 463 if len(self.partition_keys) > 0:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in
> read(self, columns, nthreads, use_pandas_metadata)
> 150 columns, use_pandas_metadata=use_pandas_metadata)
> 151 return self.reader.read_all(column_indices=column_indices,
> --> 152 nthreads=nthreads)
> 153
> 154 def scan_contents(self, columns=None, batch_size=65536):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in
> pyarrow._parquet.ParquetReader.read_all()
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in
> pyarrow.lib.check_status()
> ArrowIOError: Couldn't deserialize thrift: No more data to read.
> Deserializing page header failed.
> {code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)