[jira] [Created] (ARROW-16546) [Python] Pyarrow fails to loads parquet file with long column names

Boris Urman (Jira) Thu, 12 May 2022 08:03:14 -0700

Boris Urman created ARROW-16546:
-----------------------------------

             Summary: [Python] Pyarrow fails to loads parquet file with long 
column names
                 Key: ARROW-16546
                 URL: https://issues.apache.org/jira/browse/ARROW-16546
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 8.0.0
         Environment: Ubuntu 20.04, pandas 1.4.2
            Reporter: Boris Urman
         Attachments: Screenshot from 2022-05-12 16-59-10.png


When loading parquet file "OSError: Couldn't deserialize thrift: 
TProtocolException: Exceeded size limit" is raised. This seems to be related to 
memory usage of table header. The issue may be coming from C code part. Also 
pyarrow 0.16 version is capable to read that parquet file.

Below is code snippet to reproduce the issue. Screenshot of jupyter-notebook 
with more details is in attachments.

Code snippet creates 2 pandas dataframes which only differ in column names. One 
with short column names is stored and read without problem while the other 
dataframe with long column names is stored but raises Exception during reading.


{code:java}
import pandas as pd
import numpy as np

data = np.random.randn(10, 250000)
index = range(10)
short_column_names = [f"col_{i}" for i in range(250000)]
long_column_names = 
[f"some_really_long_column_name_ending_with_integer_number_{i}" for i in 
range(250000)]

df_short_cols = pd.DataFrame(columns=short_column_names, data=data, index=index)
df_long_cols = pd.DataFrame(columns=long_column_names, data=data, index=index)# 
Identical dataframes only column names are different

# Storing dataframe with long column names works OK but reading fails
df_long_cols.to_parquet("long_cols.parquet", engine="pyarrow") # Storing works
df_long_cols_loaded = pd.read_parquet("long_cols.parquet", engine="pyarrow") # 
<--- Fails here{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16546) [Python] Pyarrow fails to loads parquet file with long column names

Reply via email to