[ https://issues.apache.org/jira/browse/ARROW-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou resolved ARROW-10532. ------------------------------------ Resolution: Fixed Issue resolved by pull request 8624 [https://github.com/apache/arrow/pull/8624] > [Python] Mangled pandas_metadata when specified schema has different order as > DataFrame columns > ----------------------------------------------------------------------------------------------- > > Key: ARROW-10532 > URL: https://issues.apache.org/jira/browse/ARROW-10532 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 2.0.0 > Environment: Ubuntu 20.04 with Python 3.8.6 from miniconda / > conda-forge > Reporter: Zane Selvans > Assignee: Joris Van den Bossche > Priority: Major > Labels: pull-request-available > Fix For: 2.0.1, 3.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > When calling pyarrow.Table.from_pandas() with an explicit schema, the > ordering of the columns in the dataframe and the schema have to be identical, > because the pandas_metadata fields are associated with columns on the basis > of the ordering, rather than the name of their column. If the ordering of the > dataframe columns and schema fields isn't identical, then you end up > associating metadata with the wrong fields, which leads to all kinds of > errors. > > {code:java} > import pyarrow as pa > import pandas as pd > import numpy as np > data_col = np.random.random_sample(2) > datetime_col = pd.date_range("2020-01-01T00:00:00Z", freq="H", periods=2) > data_field = pa.field("data_col", pa.float32(), nullable=True) > datetime_field = pa.field("datetime_utc", pa.timestamp("s", tz="UTC"), > nullable=False) > df = pd.DataFrame({"datetime_utc": datetime_col, "data_col": data_col}) > good_schema = pa.schema([datetime_field, data_field]) > bad_schema = pa.schema([data_field, datetime_field]) > pa.Table.from_pandas(df, preserve_index=False, > schema=good_schema).schema.pandas_metadata > #{'index_columns': [], > # 'column_indexes': [], > # 'columns': [{'name': 'datetime_utc', > # 'field_name': 'datetime_utc', > # 'pandas_type': 'datetimetz', > # 'numpy_type': 'datetime64[ns]', > # 'metadata': {'timezone': 'UTC'}}, > # {'name': 'data_col', > # 'field_name': 'data_col', > # 'pandas_type': 'float32', > # 'numpy_type': 'float64', > # 'metadata': None}], > # 'creator': {'library': 'pyarrow', 'version': '2.0.0'}, > # 'pandas_version': '1.1.4'} > pa.Table.from_pandas(df, preserve_index=False, > schema=bad_schema).schema.pandas_metadata > #{'index_columns': [], > # 'column_indexes': [], > # 'columns': [{'name': 'data_col', > # 'field_name': 'data_col', > # 'pandas_type': 'float32', > # 'numpy_type': 'datetime64[ns]', > # 'metadata': {'timezone': 'UTC'}}, > # {'name': 'datetime_utc', > # 'field_name': 'datetime_utc', > # 'pandas_type': 'datetimetz', > # 'numpy_type': 'float64', > # 'metadata': None}], > # 'creator': {'library': 'pyarrow', 'version': '2.0.0'}, > # 'pandas_version': '1.1.4'} > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)