jonkeane commented on pull request #9615:
URL: https://github.com/apache/arrow/pull/9615#issuecomment-834795030
Here's another example of trying a data.frame of strings and not seeing
parallelization, but converting those strings to factors and boom we get
parallelization:
```
> library(arrow)
Attaching package: ‘arrow’
The following object is masked from ‘package:utils’:
timestamp
>
> # this sample is located at
https://ursa-qa.s3.amazonaws.com/single_types/type_strings.parquet
> # it is 1M rows, 5 columns. The first column has no missing, the second
has 10% missing,
> # the third 25% missing, the fourth 50% missing, and the 5th 90% missing.
> strings_df <- read_parquet("~/repos/ab_store/data/type_strings.parquet")
>
> # embiggen so that the transform differences are easier to see (and so we
have more columns than cores)
> strings_df <- dplyr::bind_cols(strings_df, strings_df, strings_df)
New names:
* jane -> jane...1
* austen -> austen...2
* sense -> sense...3
* and -> and...4
* sensibility -> sensibility...5
* ...
> strings_df <- dplyr::bind_rows(strings_df, strings_df, strings_df,
strings_df, strings_df)
>
> summary(strings_df)
jane...1 austen...2 sense...3 and...4
Length:5000000 Length:5000000 Length:5000000 Length:5000000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
sensibility...5 jane...6 austen...7 sense...8
Length:5000000 Length:5000000 Length:5000000 Length:5000000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
and...9 sensibility...10 jane...11 austen...12
Length:5000000 Length:5000000 Length:5000000 Length:5000000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
sense...13 and...14 sensibility...15
Length:5000000 Length:5000000 Length:5000000
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
>
> # when this runs, my cpu usage is always at 100% or slightly below (and
note
> # that user <= elapsed below)
> system.time(tab <- Table$create(strings_df))
user system elapsed
31.855 0.842 32.806
>
>
> # naively turn the strings into factors:
> strings_as_factors_df <- dplyr::mutate(strings_df, dplyr::across(.fns =
as.factor))
>
> summary(strings_as_factors_df)
jane...1 austen...2 sense...3 and...4
Elino : 1210 Elinor : 1140 : 915 Elino : 625
Elinor : 1120 Elino : 1125 Elinor : 895 Elinor : 625
: 1115 : 895 Elino : 890 : 545
Mariann: 890 Maria : 745 Marian : 705 Marian : 480
Elinor : 880 Elinor : 725 Elinor : 670 Elinor : 455
Marian : 865 (Other):4491725 (Other):3746365 (Other):2497925
(Other):4993920 NA's : 503645 NA's :1249560 NA's :2499345
sensibility...5 jane...6 austen...7 sense...8
Elino : 65 Elino : 1210 Elinor : 1140 : 915
Marian : 65 Elinor : 1120 Elino : 1125 Elinor : 895
Maria : 60 : 1115 : 895 Elino : 890
Elinor : 55 Mariann: 890 Maria : 745 Marian : 705
Elinor : 55 Elinor : 880 Elinor : 725 Elinor : 670
(Other): 249925 Marian : 865 (Other):4491725 (Other):3746365
NA's :4749775 (Other):4993920 NA's : 503645 NA's :1249560
and...9 sensibility...10 jane...11 austen...12
Elino : 625 Elino : 65 Elino : 1210 Elinor : 1140
Elinor : 625 Marian : 65 Elinor : 1120 Elino : 1125
: 545 Maria : 60 : 1115 : 895
Marian : 480 Elinor : 55 Mariann: 890 Maria : 745
Elinor : 455 Elinor : 55 Elinor : 880 Elinor : 725
(Other):2497925 (Other): 249925 Marian : 865 (Other):4491725
NA's :2499345 NA's :4749775 (Other):4993920 NA's : 503645
sense...13 and...14 sensibility...15
: 915 Elino : 625 Elino : 65
Elinor : 895 Elinor : 625 Marian : 65
Elino : 890 : 545 Maria : 60
Marian : 705 Marian : 480 Elinor : 55
Elinor : 670 Elinor : 455 Elinor : 55
(Other):3746365 (Other):2497925 (Other): 249925
NA's :1249560 NA's :2499345 NA's :4749775
>
>
> # when this runs, my cpu usage goes up to 400% (and user >> elapsed below)
> system.time(tab <- Table$create(strings_as_factors_df))
user system elapsed
31.166 0.794 6.184
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]