[ 
https://issues.apache.org/jira/browse/ARROW-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170879#comment-17170879
 ] 

Joris Van den Bossche commented on ARROW-9637:
----------------------------------------------

Thanks for opening an issue on the pandas side! Since the timing you do is only 
for the groupby operation (which is fully done in pandas), and does not include 
the reading of the parquet file, I am going to close the issue here, but will 
further follow-up on the pandas issue.

> [Python] Speed degradation with categoricals
> --------------------------------------------
>
>                 Key: ARROW-9637
>                 URL: https://issues.apache.org/jira/browse/ARROW-9637
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Larry Parker
>            Priority: Major
>         Attachments: fact1__c.parquet.zip
>
>
> I have noticed some major speed degradation when using categorical data 
> types.  For example, a Parquet file with 1 million rows that sums 10 float 
> columns and groups by two columns (one a date column and one a category 
> column).  The cardinality of the category seems to have a major effect.  When 
> grouping on category column of cardinality 10, performance is decent (query 
> runs in 150 ms).  But with cardinality of 100, the query runs in 10 seconds.  
> If I switch over to my Parquet file that does *not* have categorical columns, 
> the same query that took 10 seconds with categoricals now runs in 350 ms.
> I would be happy to post the Pandas code that I'm using (including how I'm 
> creating the Parquet file), but I first wanted to report this and see if it's 
> a known issue.
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to