chrisroat commented on issue #1688:
URL: https://github.com/apache/arrow/issues/1688#issuecomment-751128963
What is the current state of categoricals with pyarrow? The `categories`
parameter mentioned above does not seem to be accepted in `pd.read_parquet`
anymore. I see that read/write of `int` categoricals does not work, though
`str` do -- except if the file is written by fastparquet.
Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the
following handling of categoricals:
```
import os
import pandas as pd
fname = '/tmp/tst'
data = {
'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
'str': pd.Series(['foo', 'bar'] * 1000,
dtype=pd.CategoricalDtype(['foo', 'bar'])),
}
df = pd.DataFrame(data)
for write in ['fastparquet', 'pyarrow']:
for read in ['fastparquet', 'pyarrow']:
if os.path.exists(fname):
os.remove(fname)
df.to_parquet(fname, engine=write, compression=None)
df_read = pd.read_parquet(fname, engine=read)
print()
print('write:', write, 'read:', read)
for t in data.keys():
print(t, df[t].dtype == df_read[t].dtype)
```
```
write: fastparquet read: fastparquet
int True
str True
write: fastparquet read: pyarrow
int False
str False
write: pyarrow read: fastparquet
int True
str True
write: pyarrow read: pyarrow
int False
str True
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]