Joshua Goller created ARROW-9686:
------------------------------------
Summary: Parquet table schema missing columns when created from
Pandas DataFrame with List data column
Key: ARROW-9686
URL: https://issues.apache.org/jira/browse/ARROW-9686
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 1.0.0
Reporter: Joshua Goller
In the example below, I create a Parquet table from a Pandas DataFrame
containing a single column of lists. The table can be written and read back
correctly, but when I try to examine the schema, the column is missing. Is this
intentional behavior?
{code:java}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Version check
assert pa.__version__ == '1.0.0'
# Create a dataframe with one column where each row is a list
bad_data_df = pd.DataFrame({"data": [[j**i for i in range(10)] for j in
range(10)]})
# Convert to pyarrow table and save as parquet
path = "/tmp/pyarrow_bug_poc_bad_index"
pa_table = pa.Table.from_pandas(bad_data_df)
pa.parquet.write_table(pa_table, path)
# Now read it back
ds = pq.ParquetDataset(path)
table = ds.read()
read_df = table.to_pandas()
# This assertion passes; the dataframe has the correct columns
assert 'data' in read_df.columns
# This assertion fails; the schema was apparently not updated!
assert "data" in ds.schema.names
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)