[jira] [Resolved] (ARROW-10532) [Python] Mangled pandas_metadata when specified schema has different order as DataFrame columns

Antoine Pitrou (Jira) Mon, 16 Nov 2020 12:28:22 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Antoine Pitrou resolved ARROW-10532.
------------------------------------
    Resolution: Fixed

Issue resolved by pull request 8624
[https://github.com/apache/arrow/pull/8624]

> [Python] Mangled pandas_metadata when specified schema has different order as 
> DataFrame columns
> -----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10532
>                 URL: https://issues.apache.org/jira/browse/ARROW-10532
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>         Environment: Ubuntu 20.04 with Python 3.8.6 from miniconda / 
> conda-forge
>            Reporter: Zane Selvans
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.1, 3.0.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When calling pyarrow.Table.from_pandas() with an explicit schema, the 
> ordering of the columns in the dataframe and the schema have to be identical, 
> because the pandas_metadata fields are associated with columns on the basis 
> of the ordering, rather than the name of their column. If the ordering of the 
> dataframe columns and schema fields isn't identical, then you end up 
> associating metadata with the wrong fields, which leads to all kinds of 
> errors.
>  
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import numpy as np
> data_col = np.random.random_sample(2)
> datetime_col = pd.date_range("2020-01-01T00:00:00Z", freq="H", periods=2)
> data_field = pa.field("data_col", pa.float32(), nullable=True)
> datetime_field = pa.field("datetime_utc", pa.timestamp("s", tz="UTC"), 
> nullable=False)
> df = pd.DataFrame({"datetime_utc": datetime_col, "data_col": data_col})
> good_schema = pa.schema([datetime_field, data_field])
> bad_schema = pa.schema([data_field, datetime_field])
> pa.Table.from_pandas(df, preserve_index=False, 
> schema=good_schema).schema.pandas_metadata
> #{'index_columns': [],
> # 'column_indexes': [],
> # 'columns': [{'name': 'datetime_utc',
> #   'field_name': 'datetime_utc',
> #   'pandas_type': 'datetimetz',
> #   'numpy_type': 'datetime64[ns]',
> #   'metadata': {'timezone': 'UTC'}},
> #  {'name': 'data_col',
> #   'field_name': 'data_col',
> #   'pandas_type': 'float32',
> #   'numpy_type': 'float64',
> #   'metadata': None}],
> # 'creator': {'library': 'pyarrow', 'version': '2.0.0'},
> # 'pandas_version': '1.1.4'}
> pa.Table.from_pandas(df, preserve_index=False, 
> schema=bad_schema).schema.pandas_metadata
> #{'index_columns': [],
> # 'column_indexes': [],
> # 'columns': [{'name': 'data_col',
> #   'field_name': 'data_col',
> #   'pandas_type': 'float32',
> #   'numpy_type': 'datetime64[ns]',
> #   'metadata': {'timezone': 'UTC'}},
> #  {'name': 'datetime_utc',
> #   'field_name': 'datetime_utc',
> #   'pandas_type': 'datetimetz',
> #   'numpy_type': 'float64',
> #   'metadata': None}],
> # 'creator': {'library': 'pyarrow', 'version': '2.0.0'},
> # 'pandas_version': '1.1.4'}
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-10532) [Python] Mangled pandas_metadata when specified schema has different order as DataFrame columns

Reply via email to