George Sakkis created ARROW-4492:
------------------------------------
Summary: ValueError: Categorical categories must be unique
Key: ARROW-4492
URL: https://issues.apache.org/jira/browse/ARROW-4492
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.12.0
Reporter: George Sakkis
Attachments: slug.pq
On pyarrow 0.12.0 some (but not all) columns cannot be read as category dtype.
Attached is an extracted failing sample.
{noformat}
import dask.dataframe as dd
df = dd.read_parquet('slug.pq', categories=['slug'], engine='pyarrow').compute()
print(len(df['slug'].dtype.categories))
{noformat}
This works on pyarrow 0.11.1 (and fastparquet 0.2.1).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)