[
https://issues.apache.org/jira/browse/ARROW-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271372#comment-17271372
]
Joris Van den Bossche commented on ARROW-11157:
-----------------------------------------------
Additional answer specifically about this:
bq. The `categories` parameter mentioned in this GitHub issue does not seem to
be accepted in `pd.read_parquet` anymore
It certainly still works in the {{to_pandas}} method:
{code}
In [89]: pa.table({'int': [1, 2]}).to_pandas().dtypes
Out[89]:
int int64
dtype: object
In [90]: pa.table({'int': [1, 2]}).to_pandas(categories=['int']).dtypes
Out[90]:
int category
dtype: object
{code}
It might be that this worked at some point directly in the {{pd.read_parquet}}
function, but indeed not at the moment. The main problem here is that
additional keywords ({{kwargs}}) can be passed either to parquet.read_table or
to Table.to_pandas. And right now they are passed to {{read_table}}. We should
probably have a way to specify keywords for {{to_pandas}} as well.
As a workaround, you can read with pyarrow and do the conversion to pandas
manually. So basically instead of {{pd.parquet(..)}} you can do
{{pyarrow.parquet.read_table(..).to_pandas(..)}}.
> [Python] Consistent handling of categoricals
> --------------------------------------------
>
> Key: ARROW-11157
> URL: https://issues.apache.org/jira/browse/ARROW-11157
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 2.0.0
> Reporter: Chris Roat
> Priority: Minor
>
> What is the current state of categoricals with pyarrow? The `categories`
> parameter mentioned [in this
> GitHub|https://github.com/apache/arrow/issues/1688] issue does not seem to be
> accepted in `pd.read_parquet` anymore. I see that read/write of `int`
> categoricals does not work, though `str` do -- except if the file is written
> by fastparquet.
> Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the following
> handling of categoricals:
>
> {code:java}
> import os
> import pandas as pd
> fname = '/tmp/tst'
> data = {
> 'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
> 'str': pd.Series(['foo', 'bar'] * 1000, dtype=pd.CategoricalDtype(['foo',
> 'bar'])),
> }
> df = pd.DataFrame(data)
> for write in ['fastparquet', 'pyarrow']:
> for read in ['fastparquet', 'pyarrow']:
> if os.path.exists(fname):
> os.remove(fname)
> df.to_parquet(fname, engine=write, compression=None)
> df_read = pd.read_parquet(fname, engine=read)
> print()
> print('write:', write, 'read:', read)
> for t in data.keys():
> print(t, df[t].dtype == df_read[t].dtype){code}
>
>
> {noformat}
> write: fastparquet read: fastparquet
> int True
> str True
> write: fastparquet read: pyarrow
> int False
> str False
> write: pyarrow read: fastparquet
> int True
> str True
> write: pyarrow read: pyarrow
> int False
> str True
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)