Joshua Goller created ARROW-9686:
------------------------------------

             Summary: Parquet table schema missing columns when created from 
Pandas DataFrame with List data column
                 Key: ARROW-9686
                 URL: https://issues.apache.org/jira/browse/ARROW-9686
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 1.0.0
            Reporter: Joshua Goller


In the example below, I create a Parquet table from a Pandas DataFrame 
containing a single column of lists. The table can be written and read back 
correctly, but when I try to examine the schema, the column is missing. Is this 
intentional behavior? 

 
{code:java}
import numpy as np 
import pandas as pd 
import pyarrow as pa 
import pyarrow.parquet as pq 
# Version check 
assert pa.__version__ == '1.0.0'

# Create a dataframe with one column where each row is a list 
bad_data_df = pd.DataFrame({"data": [[j**i for i in range(10)] for j in 
range(10)]})

# Convert to pyarrow table and save as parquet 
path = "/tmp/pyarrow_bug_poc_bad_index" 
pa_table = pa.Table.from_pandas(bad_data_df) 
pa.parquet.write_table(pa_table, path)

# Now read it back 
ds = pq.ParquetDataset(path) 
table = ds.read() 
read_df = table.to_pandas()

# This assertion passes; the dataframe has the correct columns
assert 'data' in read_df.columns

# This assertion fails; the schema was apparently not updated! 
assert "data" in ds.schema.names
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to