jonkeane commented on pull request #9615:
URL: https://github.com/apache/arrow/pull/9615#issuecomment-834440235
Oh, knowing about missing values is helpful, lemme dig more into that and
see if I can replicate performance differences on those.
Here's `summary()` of the fanniemae dataset, and that are a decent chunk of
NAs in it (and has types too):
```
> summary(df_fannie)
f0 f1 f2 f3
Min. :100001420754 Length:22180168 Length:22180168 Min. :1.750
1st Qu.:326084086722 Class :character Class :character 1st Qu.:3.375
Median :550659611473 Mode :character Mode :character Median :3.500
Mean :550440259075 Mean :3.561
3rd Qu.:775451076920 3rd Qu.:3.750
Max. :999999800242 Max. :5.900
f4 f5 f6 f7
Min. : 0 Min. :-1.00 Min. : 27.0 Min. : 0.0
1st Qu.: 136660 1st Qu.: 8.00 1st Qu.:220.0 1st Qu.:214.0
Median : 206558 Median :16.00 Median :336.0 Median :333.0
Mean : 226014 Mean :16.65 Mean :292.9 Mean :287.5
3rd Qu.: 299858 3rd Qu.:25.00 3rd Qu.:348.0 3rd Qu.:347.0
Max. :1203000 Max. :58.00 Max. :482.0 Max. :480.0
NA's :4058158 NA's :28840
f8 f9 f10 f11
Length:22180168 Min. : 0 Length:22180168 Length:22180168
Class :character 1st Qu.:17460 Class :character Class :character
Mode :character Median :31080 Mode :character Mode :character
Mean :28225
3rd Qu.:39580
Max. :49740
f12 f13 f14 f15
Min. : 1 Length:22180168 Length:22180168 Length:22180168
1st Qu.: 1 Class :character Class :character Class :character
Median : 1 Mode :character Mode :character Mode :character
Mean : 1
3rd Qu.: 1
Max. :16
NA's :22061889
f16 f17 f18 f19
Length:22180168 Min. : 3 Min. : 187 Min. : 65
Class :character 1st Qu.: 2945 1st Qu.: 730 1st Qu.:1319
Mode :character Median : 4658 Median : 3679 Median :2500
Mean : 5143 Mean : 6808 Mean :2561
3rd Qu.: 7026 3rd Qu.: 7542 3rd Qu.:2605
Max. :23055 Max. :55625 Max. :9900
NA's :22180014 NA's :22180087 NA's :22180124
f20 f21 f22 f23
Min. :-3561 Min. : 87 Min. : 4911 Min. : 284
1st Qu.: -34 1st Qu.: 1089 1st Qu.: 73622 1st Qu.: 12374
Median : 869 Median : 2214 Median :127763 Median : 20665
Mean : 1345 Mean : 3399 Mean :147056 Mean : 36262
3rd Qu.: 1877 3rd Qu.: 3980 3rd Qu.:198586 3rd Qu.: 40433
Max. :36497 Max. :24840 Max. :465825 Max. :539401
NA's :22180041 NA's :22180053 NA's :22180023 NA's :22180081
f24 f25 f26 f27
Min. :126773 Min. : 0 Min. : 0 Mode:logical
1st Qu.:126773 1st Qu.: 110 1st Qu.: 0 NA's:22180168
Median :126773 Median : 500 Median : 0
Mean :126773 Mean : 14636 Mean : 2807
3rd Qu.:126773 3rd Qu.: 2846 3rd Qu.: 0
Max. :126773 Max. :328871 Max. :129946
NA's :22180167 NA's :22180095 NA's :22151328
f28 f29 f30
Length:22180168 Mode:logical Length:22180168
Class :character NA's:22180168 Class :character
Mode :character Mode :character
```
I also have been digging into differences across types. Factors seem to
parallelize really well, so I tried to convert the chitraffic data frame which
is a mic of strings + numerics + 2 factor columns, and when I do that (with 12
cpu cores available) the most I’m seeing the CPU get to is ~140% and even that
is only briefly, most of the time the process is at 100%
```
> system.time(tab_chi_traffic <- arrow::Table$create(df_chi_traffic))
user system elapsed
29.093 0.797 28.002
```
I then created a silly version of this dataset where I converted each of the
columns into a factor (totally naively with as.factor()), and converting that
is about half the time + the cpu usage peaks at ~300% though it drops down to
100% and then bumps back up a few times
```
> system.time(tab_chi_traffic <- arrow::Table$create(df_chi_traffic_factors))
user system elapsed
31.073 1.194 15.857
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]