[jira] [Updated] (ARROW-3766) pa.Table.from_pandas doesn't use schema ordering

Christian Thiel (JIRA) Mon, 12 Nov 2018 00:29:14 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Christian Thiel updated ARROW-3766:
-----------------------------------
    Description: 
Pyarrow is sensitive to the order of the columns upon load of partitioned Files.
With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we can 
apply a schema to a dataframe. I noticed that the returned {{pa.Table}} object 
does use the ordering of pandas columns rather than the schema columns. 
Furthermore it is possible to have columns in the schema but not in the 
DataFrame (and hence in the resulting pa.Table).

This behaviour requires a lot of fiddling with the pandas Frame in the first 
place if we like to write compatible partitioned files. Hence I argue that for 
{{pa.Table.from_pandas}}, and any other comparable function, the schema should 
be the principal source for the Table structure and not the columns and the 
ordering in the pandas DataFrame. If I specify a schema I simply expect that 
the resulting Table actually has this schema.

Here is a little example. If you remove the reordering of df2 everything works 
fine:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import os
import numpy as np
import shutil

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
    shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
strings = np.array([np.nan, np.nan, 'a', 'b'])

df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df.index.name='DPRD_ID'
df['arrays'] = pd.Series(arrays)
df['strings'] = pd.Series(strings)

my_schema = pa.schema([('DPRD_ID', pa.int64()),
                       ('partition_column', pa.int32()),
                       ('arrays', pa.list_(pa.int32())),
                       ('strings', pa.string()),
                       ('new_column', pa.string())])

df1 = df[df.partition_column==0]
df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]


table1 = pa.Table.from_pandas(df1, schema=my_schema)
table2 = pa.Table.from_pandas(df2, schema=my_schema)

pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))

pd.read_parquet(PATH_PYARROW_MANUAL)
{code}

If 


  was:
Pyarrow is sensitive to the order of the columns upon load of partitioned Files.
With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we can 
apply a schema to a dataframe. I noticed that the returned {{pa.Table}} object 
does use the ordering of pandas columns rather than the schema columns. 
Furthermore it is possible to have columns in the schema but not in the 
DataFrame (and hence in the resulting pa.Table).

This behaviour requires a lot of fiddling with the pandas Frame in the first 
place if we like to write compatible partitioned files. Hence I argue that for 
{{pa.Table.from_pandas}}, and any other comparable function, the schema should 
be the principal source for the Table structure and not the columns and the 
ordering in the pandas DataFrame. If I specify a schema I simply expect that 
the resulting Table actually has this schema.

Here is a little example. If you remove the reordering of df2 everything works 
fine:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import os
import numpy as np
import shutil

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
    shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
strings = np.array([np.nan, np.nan, 'a', 'b'])

df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df.index.name='DPRD_ID'
df['arrays'] = pd.Series(arrays)
df['strings'] = pd.Series(strings)

my_schema = pa.schema([('DPRD_ID', pa.int64()),
                       ('partition_column', pa.int32()),
                       ('arrays', pa.list_(pa.int32())),
                       ('strings', pa.string()),
                       ('new_column', pa.string())])

df1 = df[df.partition_column==0]
df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]


table1 = pa.Table.from_pandas(df1, schema=my_schema)
table2 = pa.Table.from_pandas(df2, schema=my_schema)

pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))

pd.read_parquet(PATH_PYARROW_MANUAL)
{code}



> pa.Table.from_pandas doesn't use schema ordering
> ------------------------------------------------
>
>                 Key: ARROW-3766
>                 URL: https://issues.apache.org/jira/browse/ARROW-3766
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Christian Thiel
>            Priority: Major
>              Labels: parquet
>
> Pyarrow is sensitive to the order of the columns upon load of partitioned 
> Files.
> With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we 
> can apply a schema to a dataframe. I noticed that the returned {{pa.Table}} 
> object does use the ordering of pandas columns rather than the schema 
> columns. Furthermore it is possible to have columns in the schema but not in 
> the DataFrame (and hence in the resulting pa.Table).
> This behaviour requires a lot of fiddling with the pandas Frame in the first 
> place if we like to write compatible partitioned files. Hence I argue that 
> for {{pa.Table.from_pandas}}, and any other comparable function, the schema 
> should be the principal source for the Table structure and not the columns 
> and the ordering in the pandas DataFrame. If I specify a schema I simply 
> expect that the resulting Table actually has this schema.
> Here is a little example. If you remove the reordering of df2 everything 
> works fine:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
>     shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>                        ('partition_column', pa.int32()),
>                        ('arrays', pa.list_(pa.int32())),
>                        ('strings', pa.string()),
>                        ('new_column', pa.string())])
> df1 = df[df.partition_column==0]
> df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]
> table1 = pa.Table.from_pandas(df1, schema=my_schema)
> table2 = pa.Table.from_pandas(df2, schema=my_schema)
> pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
> pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))
> pd.read_parquet(PATH_PYARROW_MANUAL)
> {code}
> If 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3766) pa.Table.from_pandas doesn't use schema ordering

Reply via email to