klin333 commented on issue #219:
URL:
https://github.com/apache/arrow-nanoarrow/issues/219#issuecomment-1922866227
Hi,
Not sure if related, but just in case the debug information below is of
interest. As a side note, not sure where rbind is used but dplyr::bind_rows
such be strictly superior in terms of computation time.
What I am finding is that bind_rows is normally extremely fast, but when
used in convert_array_stream with multiple batches (r debug browser code not
shown), bind_rows slows down dramatically (factor of 10). But when I make a
deep copy, bind_rows is fast again.
The reason for interest in bind_rows is because I was trying to speed up
convert_array_stream with multiple batches, by first converting each batch to
data.frame (super fast across all batches), then bind_rows in R, which again
should have been very fast, but turned out slow without a deep copy.
Any idea what's causing the slow bind_rows and whether this strategy is
viable way of speeding up convert_array_stream with multiple batches? Currently
convert_array_stream is atrociously slow with high number of batches.
```r
library(nanoarrow)
library(tictoc)
# generate some data
gen_df <- function(n) {
data.frame(w = uuid::UUIDgenerate(n = n))
}
N <- 50000
df_list <- lapply(seq(300), function(i) gen_df(N))
# bind_rows on list of tibbles is very fast, median time 200ms. rbind is
hopeless 4 seconds.
microbenchmark::microbenchmark(times = 20, dplyr::bind_rows(df_list))
binded_df <- dplyr::bind_rows(df_list)
tic(); do.call(rbind, df_list); toc()
# single stream (instant)
single_stream <- as_nanoarrow_array_stream(binded_df)
tic()
single_converted <- convert_array_stream(single_stream)
toc()
single_stream$release()
# batched streams (3 seconds)
batches <- lapply(df_list, as_nanoarrow_array)
batched_stream <- basic_array_stream(batches)
tic()
batch_converted <- convert_array_stream(batched_stream)
toc()
batched_stream$release()
# diy convert (batch conversion is super fast, but the bind_rows now takes 3
seconds)
# code taken from convert_array_stream
tic()
to <- dplyr::slice(gen_df(1), 0)
tmp <- lapply(batches, function(b)
.Call(nanoarrow:::nanoarrow_c_convert_array, b, to))
toc()
for(b in batches) {nanoarrow_pointer_release(b)}
tic()
diy_converted <- dplyr::bind_rows(tmp)
toc()
# create deep copy of the data frame list, where the vectors inside each
tibble are deep copies
# then bind_rows becomes fast again (median 200 ms)
tmp_deepcopy <- tmp
for (i in seq(length(tmp_deepcopy))) {
tmp_deepcopy[[i]]$w <- c(tmp_deepcopy[[i]]$w)
}
stopifnot(lobstr::obj_addr(tmp[[1]]$w) !=
lobstr::obj_addr(tmp_deepcopy[[1]]$w))
microbenchmark::microbenchmark(times = 10, dplyr::bind_rows(tmp_deepcopy))
```
```
sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8
LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Australia.utf8
time zone: Australia/Sydney
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] tictoc_1.2 nanoarrow_0.4.0
loaded via a namespace (and not attached):
[1] utf8_1.2.4 R6_2.5.1 microbenchmark_1.4.10
tidyselect_1.2.0 magrittr_2.0.3 glue_1.6.2
[7] tibble_3.2.1 pkgconfig_2.0.3 dplyr_1.1.4
generics_0.1.3 lifecycle_1.0.4 cli_3.6.1
[13] fansi_1.0.5 vctrs_0.6.4 renv_1.0.3
compiler_4.3.2 rstudioapi_0.15.0 tools_4.3.2
[19] pillar_1.9.0 lobstr_1.1.2 rlang_1.1.2
uuid_1.1-1
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]