[
https://issues.apache.org/jira/browse/ARROW-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170593#comment-17170593
]
Joris Van den Bossche commented on ARROW-9637:
----------------------------------------------
Based on your description, I might be an issue with the groupby calculation in
pandas, rather than the reading of parquet files with categoricals (which is
what pyarrow is used for in your case). In which case it's something to report
to pandas (https://github.com/pandas-dev/pandas/issues).
But to be sure, you will need to post a reproducible example.
> [Python] Speed degradation with categoricals
> --------------------------------------------
>
> Key: ARROW-9637
> URL: https://issues.apache.org/jira/browse/ARROW-9637
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 1.0.0
> Reporter: Larry Parker
> Priority: Major
>
> I have noticed some major speed degradation when using categorical data
> types. For example, a Parquet file with 1 million rows that sums 10 float
> columns and groups by two columns (one a date column and one a category
> column). The cardinality of the category seems to have a major effect. When
> grouping on category column of cardinality 10, performance is decent (query
> runs in 150 ms). But with cardinality of 100, the query runs in 10 seconds.
> If I switch over to my Parquet file that does *not* have categorical columns,
> the same query that took 10 seconds with categoricals now runs in 350 ms.
> I would be happy to post the Pandas code that I'm using (including how I'm
> creating the Parquet file), but I first wanted to report this and see if it's
> a known issue.
> Thanks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)