Christian Thiel created ARROW-3766:
--------------------------------------
Summary: pa.Table.from_pandas doesn't use schema ordering
Key: ARROW-3766
URL: https://issues.apache.org/jira/browse/ARROW-3766
Project: Apache Arrow
Issue Type: Bug
Components: Python
Reporter: Christian Thiel
Pyarrow is sensitive to the order of the columns upon load of partitioned Files.
With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we can
apply a schema to a dataframe. I noticed that the returned {{pa.Table}} object
does use the ordering of pandas columns rather than the schema columns.
Furthermore it is possible to have columns in the schema but not in the
DataFrame (and hence in the resulting pa.Table).
This behaviour requires a lot of fiddling with the pandas Frame in the first
place if we like to write compatible partitioned files. Hence I argue that for
{{pa.Table.from_pandas}}, and any other comparable function, the schema should
be the principal source for the Table structure and not the columns and the
ordering in the pandas DataFrame. If I specify a schema I simply expect that
the resulting Table actually has this schema.
Here is a little example. If you remove the reordering of df2 everything works
fine:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import os
import numpy as np
import shutil
PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
if os.path.exists(PATH_PYARROW_MANUAL):
shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)
arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
strings = np.array([np.nan, np.nan, 'a', 'b'])
df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df.index.name='DPRD_ID'
df['arrays'] = pd.Series(arrays)
df['strings'] = pd.Series(strings)
my_schema = pa.schema([('DPRD_ID', pa.int64()),
('partition_column', pa.int32()),
('arrays', pa.list_(pa.int32())),
('strings', pa.string()),
('new_column', pa.string())])
df1 = df[df.partition_column==0]
df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]
table1 = pa.Table.from_pandas(df1, schema=my_schema)
table2 = pa.Table.from_pandas(df2, schema=my_schema)
pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))
pd.read_parquet(PATH_PYARROW_MANUAL)
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)