jonkeane commented on pull request #9615:
URL: https://github.com/apache/arrow/pull/9615#issuecomment-841288052


   Ok, I dug into this a little bit more and I think a lot of the improvement 
on strings is happening because of non-parallelization optimizations (not that 
that's a bad thing!) though there is definitely some boost from parallelization.
   
   the default is all thread available
   ```
   > system.time(tab <- Table$create(fannie_df))
      user  system elapsed 
     7.028   0.548   2.739 
   ```
   
   Setting the threadpool to one, we see a ~doubling in elapsed time (and the 
user ~= elapsed)
   ```
   > arrow:::SetCpuThreadPoolCapacity(1L)
   > system.time(tab <- Table$create(fannie_df))
      user  system elapsed 
     5.988   0.428   5.369 
   ```
   
   Setting back to the default for my machine shows similar performance as the 
default
   ```
   > arrow:::SetCpuThreadPoolCapacity(12L)
   > system.time(tab <- Table$create(fannie_df))
      user  system elapsed 
     7.060   0.638   2.801 
   ```
   
   And finally turning off threads entirely shows similar performance to 
setting the thread pool to 1
   ```
   > options("arrow.use_threads" = FALSE)
   > arrow:::option_use_threads()
   [1] FALSE
   > system.time(tab <- Table$create(fannie_df))
     user  system elapsed 
    5.907   0.360   6.281 
   ```
   
   I also installed 4.0.0.1 on this same system and re-ran. This performance is 
in line with what we're seeing in the benchmarks.
   ```
   > system.time(tab <- Table$create(fannie_df))
      user  system elapsed 
    44.129   0.647  44.889 
   ``` 
   
   
   
   I also tried an enlarged dataset to observe cpu capacity
   ```
   > mega_fannie_df <- dplyr::bind_rows(fannie_df, fannie_df, fannie_df, 
fannie_df)
   > system.time(tab <- Table$create(mega_fannie_df))
      user  system elapsed 
    34.050   3.510  11.907 
   > arrow:::SetCpuThreadPoolCapacity(1L)
   > system.time(tab <- Table$create(mega_fannie_df))
      user  system elapsed 
    25.747   1.934  23.584 
   ```
   
   And finally turning off threads entirely:
   ```
   > system.time(tab <- Table$create(mega_fannie_df))
      user  system elapsed 
    25.714   1.948  27.840 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to