klin333 commented on issue #219:
URL: 
https://github.com/apache/arrow-nanoarrow/issues/219#issuecomment-1922866227

   Hi,
   
   Not sure if related, but just in case the debug information below is of 
interest.  As a side note, not sure where rbind is used but dplyr::bind_rows 
such be strictly superior in terms of computation time. 
   
   What I am finding is that bind_rows is normally extremely fast, but when 
used in convert_array_stream with multiple batches (r debug browser code  not 
shown), bind_rows slows down dramatically (factor of 10). But when I make a 
deep copy, bind_rows is fast again.
   
   The reason for interest in bind_rows is because I was trying to speed up 
convert_array_stream with multiple batches, by first converting each batch to 
data.frame (super fast across all batches), then bind_rows in R, which again 
should have been very fast, but turned out slow without a deep copy. 
   
   Any idea what's causing the slow bind_rows and whether this strategy is 
viable way of speeding up convert_array_stream with multiple batches? Currently 
convert_array_stream is atrociously slow with high number of batches. 
   
   ```r
   
   library(nanoarrow)
   library(tictoc)
   
   # generate some data
   gen_df <- function(n) {
     data.frame(w = uuid::UUIDgenerate(n = n))
   }
   N <- 50000
   df_list <- lapply(seq(300), function(i) gen_df(N))
   
   
   # bind_rows on list of tibbles is very fast, median time 200ms. rbind is 
hopeless 4 seconds.
   microbenchmark::microbenchmark(times = 20, dplyr::bind_rows(df_list))
   binded_df <- dplyr::bind_rows(df_list)
   tic(); do.call(rbind, df_list); toc()
   
   # single stream (instant)
   single_stream <- as_nanoarrow_array_stream(binded_df)
   tic()
   single_converted <- convert_array_stream(single_stream)
   toc()
   single_stream$release()
   
   # batched streams (3 seconds)
   batches <- lapply(df_list, as_nanoarrow_array)
   batched_stream <- basic_array_stream(batches)
   tic()
   batch_converted <- convert_array_stream(batched_stream)
   toc()
   batched_stream$release()
   
   
   # diy convert (batch conversion is super fast, but the bind_rows now takes 3 
seconds)
   # code taken from convert_array_stream
   tic()
   to <- dplyr::slice(gen_df(1), 0)
   tmp <- lapply(batches, function(b) 
.Call(nanoarrow:::nanoarrow_c_convert_array, b, to))
   toc()
   for(b in batches) {nanoarrow_pointer_release(b)}
   tic()
   diy_converted <- dplyr::bind_rows(tmp)
   toc()
   
   # create deep copy of the data frame list, where the vectors inside each 
tibble are deep copies
   # then bind_rows becomes fast again (median 200 ms)
   tmp_deepcopy <- tmp
   for (i in seq(length(tmp_deepcopy))) {
     tmp_deepcopy[[i]]$w <- c(tmp_deepcopy[[i]]$w)
   }
   stopifnot(lobstr::obj_addr(tmp[[1]]$w) != 
lobstr::obj_addr(tmp_deepcopy[[1]]$w))
   microbenchmark::microbenchmark(times = 10, dplyr::bind_rows(tmp_deepcopy))
   
   ```
   
   ```
   sessionInfo()
   
   R version 4.3.2 (2023-10-31 ucrt)
   Platform: x86_64-w64-mingw32/x64 (64-bit)
   Running under: Windows 11 x64 (build 22621)
   
   Matrix products: default
   
   
   locale:
   [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8    
LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
   [5] LC_TIME=English_Australia.utf8    
   
   time zone: Australia/Sydney
   tzcode source: internal
   
   attached base packages:
   [1] stats     graphics  grDevices datasets  utils     methods   base     
   
   other attached packages:
   [1] tictoc_1.2      nanoarrow_0.4.0
   
   loaded via a namespace (and not attached):
    [1] utf8_1.2.4            R6_2.5.1              microbenchmark_1.4.10 
tidyselect_1.2.0      magrittr_2.0.3        glue_1.6.2           
    [7] tibble_3.2.1          pkgconfig_2.0.3       dplyr_1.1.4           
generics_0.1.3        lifecycle_1.0.4       cli_3.6.1            
   [13] fansi_1.0.5           vctrs_0.6.4           renv_1.0.3            
compiler_4.3.2        rstudioapi_0.15.0     tools_4.3.2          
   [19] pillar_1.9.0          lobstr_1.1.2          rlang_1.1.2           
uuid_1.1-1           
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to