[jira] [Comment Edited] (ARROW-9637) [Python] Speed degradation with categoricals

Larry Parker (Jira) Tue, 04 Aug 2020 06:20:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170789#comment-17170789
 ]


Larry Parker edited comment on ARROW-9637 at 8/4/20, 1:19 PM:
--------------------------------------------------------------

import datetime as dt

import pandas as pd
 import pyarrow as pa

pd.options.display.float_format = '\{:,.4f}'.format

print("Pandas v" + pd.__version__)
 print("PyArrow v" + pa.__version__)

folder = "/users/lparker/data/pandas_perf/"

def load(table_name):
 file_name = folder + table_name + ".parquet"

df = pd.read_parquet(file_name)

return df

df__fact1 = load("fact1__nc")      # non-categorical file
 # df__fact1 = load("fact1__c")     # categorical file

print("\nRow count = {}".format(len(df__fact1)))
 print(df__fact1.dtypes)

dim_col = "a2"
 print("\nCardinality = {}".format(df__fact1[dim_col].nunique()))

ts0 = dt.datetime.now()

df = df__fact1.groupby(["date", dim_col]).agg(m1=("m1", "sum"), m2=("m2", 
"sum"), m3=("m3", "sum"), m4=("m4", "sum"), m5=("m5", "sum"), m6=("m6", "sum"), 
m7=("m7", "sum"), m8=("m8", "sum"), m9=("m9", "sum"), m10=("m10", "sum"))

ts1 = dt.datetime.now()
 print("Query time (ms) = " + str(int((ts1 - ts0).total_seconds() * 1000)))

print("\nRow count = {}".format(len(df)))
 print(df.head(10))


was (Author: lparker):
```

import datetime as dt

import pandas as pd
import pyarrow as pa

pd.options.display.float_format = '\{:,.4f}'.format

print("Pandas v" + pd.__version__)
print("PyArrow v" + pa.__version__)

folder = "/users/lparker/data/pandas_perf/"

def load(table_name):
 file_name = folder + table_name + ".parquet"

 df = pd.read_parquet(file_name)

 return df

df__fact1 = load("fact1__nc")      # non-categorical file
# df__fact1 = load("fact1__c")     # categorical file

print("\nRow count = {}".format(len(df__fact1)))
print(df__fact1.dtypes)

dim_col = "a2"
print("\nCardinality = {}".format(df__fact1[dim_col].nunique()))

ts0 = dt.datetime.now()

df = df__fact1.groupby(["date", dim_col]).agg(m1=("m1", "sum"), m2=("m2", 
"sum"), m3=("m3", "sum"), m4=("m4", "sum"), m5=("m5", "sum"), m6=("m6", "sum"), 
m7=("m7", "sum"), m8=("m8", "sum"), m9=("m9", "sum"), m10=("m10", "sum"))

ts1 = dt.datetime.now()
print("Query time (ms) = " + str(int((ts1 - ts0).total_seconds() * 1000)))

print("\nRow count = {}".format(len(df)))
print(df.head(10))```

> [Python] Speed degradation with categoricals
> --------------------------------------------
>
>                 Key: ARROW-9637
>                 URL: https://issues.apache.org/jira/browse/ARROW-9637
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Larry Parker
>            Priority: Major
>
> I have noticed some major speed degradation when using categorical data 
> types.  For example, a Parquet file with 1 million rows that sums 10 float 
> columns and groups by two columns (one a date column and one a category 
> column).  The cardinality of the category seems to have a major effect.  When 
> grouping on category column of cardinality 10, performance is decent (query 
> runs in 150 ms).  But with cardinality of 100, the query runs in 10 seconds.  
> If I switch over to my Parquet file that does *not* have categorical columns, 
> the same query that took 10 seconds with categoricals now runs in 350 ms.
> I would be happy to post the Pandas code that I'm using (including how I'm 
> creating the Parquet file), but I first wanted to report this and see if it's 
> a known issue.
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-9637) [Python] Speed degradation with categoricals

Reply via email to