[
https://issues.apache.org/jira/browse/ARROW-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170791#comment-17170791
]
Larry Parker edited comment on ARROW-9637 at 8/4/20, 1:27 PM:
--------------------------------------------------------------
The file _fact1____c.parquet.zip_ contains the categorical Parquet file. I
could not upload _fact1____nc.parquet.zip_ as it exceeded the 60 MB upload
limit (by 2 MB). That file contains the non-categorical Parquet file, but you
should be able to convert the uploaded to not use category columns, and save it
to _fact1____nc.parquet_.
was (Author: lparker):
The file _fact1____c.parquet.zip_ contains the categorical Parquet file. I
could not upload _fact1__nc.parquet.zip_ as it exceeded the 60 MB upload limit
(by 2 MB). That file contains the non-categorical Parquet file, but you should
be able to convert the uploaded to not use category columns, and save it to
_fact1__nc.parquet_.
> [Python] Speed degradation with categoricals
> --------------------------------------------
>
> Key: ARROW-9637
> URL: https://issues.apache.org/jira/browse/ARROW-9637
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 1.0.0
> Reporter: Larry Parker
> Priority: Major
> Attachments: fact1__c.parquet.zip
>
>
> I have noticed some major speed degradation when using categorical data
> types. For example, a Parquet file with 1 million rows that sums 10 float
> columns and groups by two columns (one a date column and one a category
> column). The cardinality of the category seems to have a major effect. When
> grouping on category column of cardinality 10, performance is decent (query
> runs in 150 ms). But with cardinality of 100, the query runs in 10 seconds.
> If I switch over to my Parquet file that does *not* have categorical columns,
> the same query that took 10 seconds with categoricals now runs in 350 ms.
> I would be happy to post the Pandas code that I'm using (including how I'm
> creating the Parquet file), but I first wanted to report this and see if it's
> a known issue.
> Thanks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)