Zane Selvans created ARROW-10532:
------------------------------------
Summary: Some pandas_metadata fields are ordred by index not label
Key: ARROW-10532
URL: https://issues.apache.org/jira/browse/ARROW-10532
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 2.0.0
Environment: Ubuntu 20.04 with Python 3.8.6 from miniconda /
conda-forge
Reporter: Zane Selvans
When calling pyarrow.Table.from_pandas() with an explicit schema, the ordering
of the columns in the dataframe and the schema have to be identical, because
the pandas_metadata fields are associated with columns on the basis of the
ordering, rather than the name of their column. If the ordering of the
dataframe columns and schema fields isn't identical, then you end up
associating metadata with the wrong fields, which leads to all kinds of errors.
{code:java}
import pyarrow as pa
import pandas as pd
import numpy as np
data_col = np.random.random_sample(2)
datetime_col = pd.date_range("2020-01-01T00:00:00Z", freq="H", periods=2)
data_field = pa.field("data_col", pa.float32(), nullable=True)
datetime_field = pa.field("datetime_utc", pa.timestamp("s", tz="UTC"),
nullable=False)
df = pd.DataFrame({"datetime_utc": datetime_col, "data_col": data_col})
good_schema = pa.schema([datetime_field, data_field])
bad_schema = pa.schema([data_field, datetime_field])
pa.Table.from_pandas(df, preserve_index=False,
schema=good_schema).schema.pandas_metadata
#{'index_columns': [],
# 'column_indexes': [],
# 'columns': [{'name': 'datetime_utc',
# 'field_name': 'datetime_utc',
# 'pandas_type': 'datetimetz',
# 'numpy_type': 'datetime64[ns]',
# 'metadata': {'timezone': 'UTC'}},
# {'name': 'data_col',
# 'field_name': 'data_col',
# 'pandas_type': 'float32',
# 'numpy_type': 'float64',
# 'metadata': None}],
# 'creator': {'library': 'pyarrow', 'version': '2.0.0'},
# 'pandas_version': '1.1.4'}
pa.Table.from_pandas(df, preserve_index=False,
schema=bad_schema).schema.pandas_metadata
#{'index_columns': [],
# 'column_indexes': [],
# 'columns': [{'name': 'data_col',
# 'field_name': 'data_col',
# 'pandas_type': 'float32',
# 'numpy_type': 'datetime64[ns]',
# 'metadata': {'timezone': 'UTC'}},
# {'name': 'datetime_utc',
# 'field_name': 'datetime_utc',
# 'pandas_type': 'datetimetz',
# 'numpy_type': 'float64',
# 'metadata': None}],
# 'creator': {'library': 'pyarrow', 'version': '2.0.0'},
# 'pandas_version': '1.1.4'}
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)