[
https://issues.apache.org/jira/browse/ARROW-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney updated ARROW-3766:
--------------------------------
Summary: [Python] pa.Table.from_pandas doesn't use schema ordering (was:
pa.Table.from_pandas doesn't use schema ordering)
> [Python] pa.Table.from_pandas doesn't use schema ordering
> ---------------------------------------------------------
>
> Key: ARROW-3766
> URL: https://issues.apache.org/jira/browse/ARROW-3766
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Christian Thiel
> Priority: Major
> Labels: parquet
> Fix For: 0.12.0
>
>
> Pyarrow is sensitive to the order of the columns upon load of partitioned
> Files.
> With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we
> can apply a schema to a dataframe. I noticed that the returned {{pa.Table}}
> object does use the ordering of pandas columns rather than the schema
> columns. Furthermore it is possible to have columns in the schema but not in
> the DataFrame (and hence in the resulting pa.Table).
> This behaviour requires a lot of fiddling with the pandas Frame in the first
> place if we like to write compatible partitioned files. Hence I argue that
> for {{pa.Table.from_pandas}}, and any other comparable function, the schema
> should be the principal source for the Table structure and not the columns
> and the ordering in the pandas DataFrame. If I specify a schema I simply
> expect that the resulting Table actually has this schema.
> Here is a little example. If you remove the reordering of df2 everything
> works fine:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
> ('partition_column', pa.int32()),
> ('arrays', pa.list_(pa.int32())),
> ('strings', pa.string()),
> ('new_column', pa.string())])
> df1 = df[df.partition_column==0]
> df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]
> table1 = pa.Table.from_pandas(df1, schema=my_schema)
> table2 = pa.Table.from_pandas(df2, schema=my_schema)
> pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
> pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))
> pd.read_parquet(PATH_PYARROW_MANUAL)
> {code}
> If
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)