[ 
https://issues.apache.org/jira/browse/ARROW-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-3766:
--------------------------------------

    Assignee: Krisztian Szucs

> [Python] pa.Table.from_pandas doesn't use schema ordering
> ---------------------------------------------------------
>
>                 Key: ARROW-3766
>                 URL: https://issues.apache.org/jira/browse/ARROW-3766
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Christian Thiel
>            Assignee: Krisztian Szucs
>            Priority: Major
>              Labels: parquet, pull-request-available
>             Fix For: 0.12.0
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Pyarrow is sensitive to the order of the columns upon load of partitioned 
> Files.
> With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we 
> can apply a schema to a dataframe. I noticed that the returned {{pa.Table}} 
> object does use the ordering of pandas columns rather than the schema 
> columns. Furthermore it is possible to have columns in the schema but not in 
> the DataFrame (and hence in the resulting pa.Table).
> This behaviour requires a lot of fiddling with the pandas Frame in the first 
> place if we like to write compatible partitioned files. Hence I argue that 
> for {{pa.Table.from_pandas}}, and any other comparable function, the schema 
> should be the principal source for the Table structure and not the columns 
> and the ordering in the pandas DataFrame. If I specify a schema I simply 
> expect that the resulting Table actually has this schema.
> Here is a little example. If you remove the reordering of df2 everything 
> works fine:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
>     shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>                        ('partition_column', pa.int32()),
>                        ('arrays', pa.list_(pa.int32())),
>                        ('strings', pa.string()),
>                        ('new_column', pa.string())])
> df1 = df[df.partition_column==0]
> df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]
> table1 = pa.Table.from_pandas(df1, schema=my_schema)
> table2 = pa.Table.from_pandas(df2, schema=my_schema)
> pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
> pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))
> pd.read_parquet(PATH_PYARROW_MANUAL)
> {code}
> If 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to