[ 
https://issues.apache.org/jira/browse/ARROW-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260751#comment-17260751
 ] 

Wes McKinney commented on ARROW-11157:
--------------------------------------

This looks buggy to me. Less consideration has been given to non-string 
categorical data since it appears less frequently in practice. 

cc [~jorisvandenbossche]

> [Python] Consistent handling of categoricals
> --------------------------------------------
>
>                 Key: ARROW-11157
>                 URL: https://issues.apache.org/jira/browse/ARROW-11157
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Chris Roat
>            Priority: Minor
>
> What is the current state of categoricals with pyarrow? The `categories` 
> parameter mentioned [in this 
> GitHub|https://github.com/apache/arrow/issues/1688] issue does not seem to be 
> accepted in `pd.read_parquet` anymore. I see that read/write of `int` 
> categoricals does not work, though `str` do -- except if the file is written 
> by fastparquet.
> Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the following 
> handling of categoricals:
>  
> {code:java}
> import os
> import pandas as pd
> fname = '/tmp/tst'
> data = {
>     'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
>     'str': pd.Series(['foo', 'bar'] * 1000, dtype=pd.CategoricalDtype(['foo', 
> 'bar'])),
> }
> df = pd.DataFrame(data)
> for write in ['fastparquet', 'pyarrow']:
>     for read in ['fastparquet', 'pyarrow']:
>         if os.path.exists(fname):
>             os.remove(fname)
>         df.to_parquet(fname, engine=write, compression=None)
>         df_read = pd.read_parquet(fname, engine=read)
>         print()
>         print('write:', write, 'read:', read)
>         for t in data.keys():
>             print(t, df[t].dtype == df_read[t].dtype){code}
>  
>  
> {noformat}
> write: fastparquet read: fastparquet
> int True
> str True
> write: fastparquet read: pyarrow
> int False
> str False
> write: pyarrow read: fastparquet
> int True
> str True
> write: pyarrow read: pyarrow
> int False
> str True
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to