Boris Urman created ARROW-16546:
-----------------------------------
Summary: [Python] Pyarrow fails to loads parquet file with long
column names
Key: ARROW-16546
URL: https://issues.apache.org/jira/browse/ARROW-16546
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 8.0.0
Environment: Ubuntu 20.04, pandas 1.4.2
Reporter: Boris Urman
Attachments: Screenshot from 2022-05-12 16-59-10.png
When loading parquet file "OSError: Couldn't deserialize thrift:
TProtocolException: Exceeded size limit" is raised. This seems to be related to
memory usage of table header. The issue may be coming from C code part. Also
pyarrow 0.16 version is capable to read that parquet file.
Below is code snippet to reproduce the issue. Screenshot of jupyter-notebook
with more details is in attachments.
Code snippet creates 2 pandas dataframes which only differ in column names. One
with short column names is stored and read without problem while the other
dataframe with long column names is stored but raises Exception during reading.
{code:java}
import pandas as pd
import numpy as np
data = np.random.randn(10, 250000)
index = range(10)
short_column_names = [f"col_{i}" for i in range(250000)]
long_column_names =
[f"some_really_long_column_name_ending_with_integer_number_{i}" for i in
range(250000)]
df_short_cols = pd.DataFrame(columns=short_column_names, data=data, index=index)
df_long_cols = pd.DataFrame(columns=long_column_names, data=data, index=index)#
Identical dataframes only column names are different
# Storing dataframe with long column names works OK but reading fails
df_long_cols.to_parquet("long_cols.parquet", engine="pyarrow") # Storing works
df_long_cols_loaded = pd.read_parquet("long_cols.parquet", engine="pyarrow") #
<--- Fails here{code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)