[
https://issues.apache.org/jira/browse/ARROW-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weston Pace updated ARROW-17852:
--------------------------------
Summary: [python] `dtype` of `Categorical` category columns are not
preserved (was: `dtype` of `Categorical` category columns are not preserved)
> [python] `dtype` of `Categorical` category columns are not preserved
> --------------------------------------------------------------------
>
> Key: ARROW-17852
> URL: https://issues.apache.org/jira/browse/ARROW-17852
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 9.0.0
> Reporter: Ryan Ballard
> Priority: Major
> Labels: categorical, pandas, pyarrow
>
> Hi there,
> First time submitting an issue here so apologies if there's anything I've
> missed.
> I see the below bug, where by the {{dtype}} of the categories themselves
> (within a {{pd.Categorical}} are not preserved on a round trip via pyarrow.
> Hopefully the snippet below demonstrates the issue.
> The reason this causes an issue, is because the dtypes need to be the same in
> order for the categories to be considered the same (so they can then be
> concatenated, for example).
> Current workaround is to store as a plain {{pd.StringDtype()}} and then
> convert to {{pd.Categorical}} in memory with Pandas (which infers from the
> underlying type, but in doing so sacrifices disk saving of storing as a
> dictionary).
> Using pyarrow 9.0.0 and pandas 1.4.4.
> Thanks
>
> {{import pandas as pd}}
> {{import pyarrow as pa}}
>
> {{{}# note, Categorical column B is constructed from
> `pd.{}}}{{{}StringDtype`{}}}
> {{df = pd.DataFrame(\{"A": ["a", "b", "c", "a"]\}, dtype=pd.StringDtype())}}
> {{df["B"] = df["A"].astype("category")}}
> {{print(df["B"].cat.categories)}}
> {{# Index(['a', 'b', 'c'], dtype='string')}}
>
> {{# however, this is downcast to `object` during a roundtrip}}
> {{print(pa.Table.from_pandas(df).to_pandas()["B"].cat.categories)}}
> {{# Index(['a', 'b', 'c'], dtype='object')}}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)