[
https://issues.apache.org/jira/browse/ARROW-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-10532:
------------------------------------------
Summary: [Python] Mangled pandas_metadata when specified schema has
different order as DataFrame columns (was: [Python] Some pandas_metadata
fields are ordred by index not label)
> [Python] Mangled pandas_metadata when specified schema has different order as
> DataFrame columns
> -----------------------------------------------------------------------------------------------
>
> Key: ARROW-10532
> URL: https://issues.apache.org/jira/browse/ARROW-10532
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0
> Environment: Ubuntu 20.04 with Python 3.8.6 from miniconda /
> conda-forge
> Reporter: Zane Selvans
> Assignee: Joris Van den Bossche
> Priority: Major
>
> When calling pyarrow.Table.from_pandas() with an explicit schema, the
> ordering of the columns in the dataframe and the schema have to be identical,
> because the pandas_metadata fields are associated with columns on the basis
> of the ordering, rather than the name of their column. If the ordering of the
> dataframe columns and schema fields isn't identical, then you end up
> associating metadata with the wrong fields, which leads to all kinds of
> errors.
>
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import numpy as np
> data_col = np.random.random_sample(2)
> datetime_col = pd.date_range("2020-01-01T00:00:00Z", freq="H", periods=2)
> data_field = pa.field("data_col", pa.float32(), nullable=True)
> datetime_field = pa.field("datetime_utc", pa.timestamp("s", tz="UTC"),
> nullable=False)
> df = pd.DataFrame({"datetime_utc": datetime_col, "data_col": data_col})
> good_schema = pa.schema([datetime_field, data_field])
> bad_schema = pa.schema([data_field, datetime_field])
> pa.Table.from_pandas(df, preserve_index=False,
> schema=good_schema).schema.pandas_metadata
> #{'index_columns': [],
> # 'column_indexes': [],
> # 'columns': [{'name': 'datetime_utc',
> # 'field_name': 'datetime_utc',
> # 'pandas_type': 'datetimetz',
> # 'numpy_type': 'datetime64[ns]',
> # 'metadata': {'timezone': 'UTC'}},
> # {'name': 'data_col',
> # 'field_name': 'data_col',
> # 'pandas_type': 'float32',
> # 'numpy_type': 'float64',
> # 'metadata': None}],
> # 'creator': {'library': 'pyarrow', 'version': '2.0.0'},
> # 'pandas_version': '1.1.4'}
> pa.Table.from_pandas(df, preserve_index=False,
> schema=bad_schema).schema.pandas_metadata
> #{'index_columns': [],
> # 'column_indexes': [],
> # 'columns': [{'name': 'data_col',
> # 'field_name': 'data_col',
> # 'pandas_type': 'float32',
> # 'numpy_type': 'datetime64[ns]',
> # 'metadata': {'timezone': 'UTC'}},
> # {'name': 'datetime_utc',
> # 'field_name': 'datetime_utc',
> # 'pandas_type': 'datetimetz',
> # 'numpy_type': 'float64',
> # 'metadata': None}],
> # 'creator': {'library': 'pyarrow', 'version': '2.0.0'},
> # 'pandas_version': '1.1.4'}
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)