[jira] [Commented] (ARROW-3801) [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable

Thomas Buhrmann (JIRA) Fri, 07 Jun 2019 03:52:34 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858496#comment-16858496
 ]


Thomas Buhrmann commented on ARROW-3801:
----------------------------------------

Hi, yes, I can confirm that pandas 0.24.2 deals correctly with the non-writable 
category index. However, I would still consider the fact that Arrow serializes 
this index as non-writable, if not a bug, an unnecessary limitation. Not 
everybody may be able to update their production code to pandas 0.24.2, so 
maybe for compatibility reasons this could still be "fixed"?

If not, I'll leave the following workaround here for reference, which I use 
whenever I load an Arrow serialized DataFrame in Python:
{code:python}
def fix_arrow_categoricals(df):
    """A roundtrip of categoricals through pd->arr->pd can make categories 
non-writeable,
    which may make other parts of pandas blow up later on.
    """
    cats = df.select_dtypes('category').columns
    for col in cats:
        # Copying resets array's flags
        df[col].cat.categories = np.copy(df[col].cat.categories)
{code}

> [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable
> ------------------------------------------------------------------------
>
>                 Key: ARROW-3801
>                 URL: https://issues.apache.org/jira/browse/ARROW-3801
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.10.0
>            Reporter: Thomas Buhrmann
>            Priority: Major
>             Fix For: 0.14.0
>
>
> Serializing and deserializing a pandas series with categorical dtype will 
> make the categorical index non-writeable, which in turn trips up pandas when 
> e.g. reordering the categories, raising "ValueError: buffer source array is 
> read-only" :
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.Series([1,2,3], dtype='category', name="c1").to_frame()
> print("DType before:", repr(df.c1.dtype))
> print("Writeable:", df.c1.cat.categories.values.flags.writeable)
> ro = df.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> tbl = pa.Table.from_pandas(df)
> df2 = tbl.to_pandas()
> print("DType after:", repr(df2.c1.dtype))
> print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ro = df2.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
> Outputs:
>  
> {code:java}
> DType before: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: True
> DType reordered: CategoricalDtype(categories=[3, 2, 1], ordered=False)
> DType after: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: False
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
> <ipython-input-365-85b439586c1a> in <module>
>  12 print("DType after:", repr(df2.c1.dtype))
>  13 print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ---> 14 ro = df2.c1.cat.reorder_categories([3,2,1])
>  15 print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3801) [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable

Reply via email to