[jira] [Closed] (ARROW-9637) [Python] Speed degradation with categoricals

Joris Van den Bossche (Jira) Tue, 04 Aug 2020 08:31:48 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche closed ARROW-9637.
----------------------------------------
    Resolution: Not A Problem

> [Python] Speed degradation with categoricals
> --------------------------------------------
>
>                 Key: ARROW-9637
>                 URL: https://issues.apache.org/jira/browse/ARROW-9637
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Larry Parker
>            Priority: Major
>         Attachments: fact1__c.parquet.zip
>
>
> I have noticed some major speed degradation when using categorical data 
> types.  For example, a Parquet file with 1 million rows that sums 10 float 
> columns and groups by two columns (one a date column and one a category 
> column).  The cardinality of the category seems to have a major effect.  When 
> grouping on category column of cardinality 10, performance is decent (query 
> runs in 150 ms).  But with cardinality of 100, the query runs in 10 seconds.  
> If I switch over to my Parquet file that does *not* have categorical columns, 
> the same query that took 10 seconds with categoricals now runs in 350 ms.
> I would be happy to post the Pandas code that I'm using (including how I'm 
> creating the Parquet file), but I first wanted to report this and see if it's 
> a known issue.
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-9637) [Python] Speed degradation with categoricals

Reply via email to