[
https://issues.apache.org/jira/browse/ARROW-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858496#comment-16858496
]
Thomas Buhrmann commented on ARROW-3801:
----------------------------------------
Hi, yes, I can confirm that pandas 0.24.2 deals correctly with the non-writable
category index. However, I would still consider the fact that Arrow serializes
this index as non-writable, if not a bug, an unnecessary limitation. Not
everybody may be able to update their production code to pandas 0.24.2, so
maybe for compatibility reasons this could still be "fixed"?
If not, I'll leave the following workaround here for reference, which I use
whenever I load an Arrow serialized DataFrame in Python:
{code:python}
def fix_arrow_categoricals(df):
"""A roundtrip of categoricals through pd->arr->pd can make categories
non-writeable,
which may make other parts of pandas blow up later on.
"""
cats = df.select_dtypes('category').columns
for col in cats:
# Copying resets array's flags
df[col].cat.categories = np.copy(df[col].cat.categories)
{code}
> [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable
> ------------------------------------------------------------------------
>
> Key: ARROW-3801
> URL: https://issues.apache.org/jira/browse/ARROW-3801
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 0.10.0
> Reporter: Thomas Buhrmann
> Priority: Major
> Fix For: 0.14.0
>
>
> Serializing and deserializing a pandas series with categorical dtype will
> make the categorical index non-writeable, which in turn trips up pandas when
> e.g. reordering the categories, raising "ValueError: buffer source array is
> read-only" :
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.Series([1,2,3], dtype='category', name="c1").to_frame()
> print("DType before:", repr(df.c1.dtype))
> print("Writeable:", df.c1.cat.categories.values.flags.writeable)
> ro = df.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> tbl = pa.Table.from_pandas(df)
> df2 = tbl.to_pandas()
> print("DType after:", repr(df2.c1.dtype))
> print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ro = df2.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>
> Outputs:
>
> {code:java}
> DType before: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: True
> DType reordered: CategoricalDtype(categories=[3, 2, 1], ordered=False)
> DType after: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: False
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
> <ipython-input-365-85b439586c1a> in <module>
> 12 print("DType after:", repr(df2.c1.dtype))
> 13 print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ---> 14 ro = df2.c1.cat.reorder_categories([3,2,1])
> 15 print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)