[ 
https://issues.apache.org/jira/browse/ARROW-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100797#comment-17100797
 ] 

Eric Kisslinger commented on ARROW-8694:
----------------------------------------

Thanks for the clarification on what qualifies as "wide". That is where my 
confusion came from. And thanks for increasing the relevant buffer size so that 
I, and others, can continue to use upcoming versions of pyarrow on existing 
datasets.

> [Python][Parquet] parquet.read_schema() fails when loading wide table created 
> from Pandas DataFrame
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-8694
>                 URL: https://issues.apache.org/jira/browse/ARROW-8694
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.17.0
>         Environment: Linux OS with RHEL 7.7 distribution
>            Reporter: Eric Kisslinger
>            Assignee: Wes McKinney
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.0.0, 0.17.1
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> parquet.read_schema() fails when loading wide table schema created from 
> Pandas DataFrame with 50,000 columns. This works ok using pyarrow 0.16.0.
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(pa.__version__)
> df = pd.DataFrame(({'c' + str(i): np.random.randn(10) for i in range(50000)}))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "test_wide.parquet")
> schema = pq.read_schema('test_wide.parquet'){code}
> Output:
> 0.17.0
> Traceback (most recent call last):
>  File 
> "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/IPython/core/interactiveshell.py",
>  line 3319, in run_code
>  exec(code_obj, self.user_global_ns, self.user_ns)
>  File "<ipython-input-29-d5ef2df77263>", line 9, in <module>
>  table = pq.read_schema('test_wide.parquet')
>  File 
> "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 1793, in read_schema
>  return ParquetFile(where, memory_map=memory_map).schema.to_arrow_schema()
>  File 
> "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 210, in __init__
>  read_dictionary=read_dictionary, metadata=metadata)
>  File "pyarrow/_parquet.pyx", line 1023, in 
> pyarrow._parquet.ParquetReader.open
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to