[
https://issues.apache.org/jira/browse/ARROW-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260751#comment-17260751
]
Wes McKinney commented on ARROW-11157:
--------------------------------------
This looks buggy to me. Less consideration has been given to non-string
categorical data since it appears less frequently in practice.
cc [~jorisvandenbossche]
> [Python] Consistent handling of categoricals
> --------------------------------------------
>
> Key: ARROW-11157
> URL: https://issues.apache.org/jira/browse/ARROW-11157
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 2.0.0
> Reporter: Chris Roat
> Priority: Minor
>
> What is the current state of categoricals with pyarrow? The `categories`
> parameter mentioned [in this
> GitHub|https://github.com/apache/arrow/issues/1688] issue does not seem to be
> accepted in `pd.read_parquet` anymore. I see that read/write of `int`
> categoricals does not work, though `str` do -- except if the file is written
> by fastparquet.
> Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the following
> handling of categoricals:
>
> {code:java}
> import os
> import pandas as pd
> fname = '/tmp/tst'
> data = {
> 'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
> 'str': pd.Series(['foo', 'bar'] * 1000, dtype=pd.CategoricalDtype(['foo',
> 'bar'])),
> }
> df = pd.DataFrame(data)
> for write in ['fastparquet', 'pyarrow']:
> for read in ['fastparquet', 'pyarrow']:
> if os.path.exists(fname):
> os.remove(fname)
> df.to_parquet(fname, engine=write, compression=None)
> df_read = pd.read_parquet(fname, engine=read)
> print()
> print('write:', write, 'read:', read)
> for t in data.keys():
> print(t, df[t].dtype == df_read[t].dtype){code}
>
>
> {noformat}
> write: fastparquet read: fastparquet
> int True
> str True
> write: fastparquet read: pyarrow
> int False
> str False
> write: pyarrow read: fastparquet
> int True
> str True
> write: pyarrow read: pyarrow
> int False
> str True
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)