jonkeane commented on pull request #9615:
URL: https://github.com/apache/arrow/pull/9615#issuecomment-841288052
Ok, I dug into this a little bit more and I think a lot of the improvement
on strings is happening because of non-parallelization optimizations (not that
that's a bad thing!) though there is definitely some boost from parallelization.
the default is all thread available
```
> system.time(tab <- Table$create(fannie_df))
user system elapsed
7.028 0.548 2.739
```
Setting the threadpool to one, we see a ~doubling in elapsed time (and the
user ~= elapsed)
```
> arrow:::SetCpuThreadPoolCapacity(1L)
> system.time(tab <- Table$create(fannie_df))
user system elapsed
5.988 0.428 5.369
```
Setting back to the default for my machine shows similar performance as the
default
```
> arrow:::SetCpuThreadPoolCapacity(12L)
> system.time(tab <- Table$create(fannie_df))
user system elapsed
7.060 0.638 2.801
```
And finally turning off threads entirely shows similar performance to
setting the thread pool to 1
```
> options("arrow.use_threads" = FALSE)
> arrow:::option_use_threads()
[1] FALSE
> system.time(tab <- Table$create(fannie_df))
user system elapsed
5.907 0.360 6.281
```
I also installed 4.0.0.1 on this same system and re-ran. This performance is
in line with what we're seeing in the benchmarks.
```
> system.time(tab <- Table$create(fannie_df))
user system elapsed
44.129 0.647 44.889
```
I also tried an enlarged dataset to observe cpu capacity
```
> mega_fannie_df <- dplyr::bind_rows(fannie_df, fannie_df, fannie_df,
fannie_df)
> system.time(tab <- Table$create(mega_fannie_df))
user system elapsed
34.050 3.510 11.907
> arrow:::SetCpuThreadPoolCapacity(1L)
> system.time(tab <- Table$create(mega_fannie_df))
user system elapsed
25.747 1.934 23.584
```
And finally turning off threads entirely:
```
> system.time(tab <- Table$create(mega_fannie_df))
user system elapsed
25.714 1.948 27.840
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]