paleolimbot opened a new issue, #39090: URL: https://github.com/apache/arrow/issues/39090
### Describe the enhancement requested As identified by @ianmcook in https://github.com/apache/arrow/pull/39081#discussion_r1416239595 ``` r library(arrow, warn.conflicts = FALSE) library(nanoarrow) rows_per_batch <- 4096 n_batches <- 4096 df <- tibble::tibble(a = 1:rows_per_batch, b = a, c = a, d = a) batches <- lapply(rep(list(df), n_batches), as_record_batch) make_reader_batches <- function() { RecordBatchReader$create(batches = batches) } make_array_stream_batches <- function() { as_nanoarrow_array_stream(make_reader_batches()) } bench::mark( # All C++, no R list involved make_reader_batches()$read_table(), # Unoptimized pure R implementation collect_array_stream(make_array_stream_batches()), # Theoretically optimized cpp11 code? make_reader_batches()$batches(), check = FALSE ) #> Warning: Some expressions had a GC in every iteration; so filtering is #> disabled. #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 make_reader_batches()$read_tab… 8.01ms 8.34ms 111. 1.09MB 7.94 #> 2 collect_array_stream(make_arra… 21.25ms 21.85ms 37.5 682.44KB 5.93 #> 3 make_reader_batches()$batches() 712.9ms 712.9ms 1.40 8.74MB 16.8 ``` This might be unfixable (i.e., it might be that that's just how long it takes to create 4096 R6 objects); however, we should possibly investigate to make sure the path that I think is being taken is actually happening: https://github.com/apache/arrow/blob/main/r/src/arrow_cpp11.h#L312-L334 + https://github.com/apache/arrow/blob/main/r/src/recordbatchreader.cpp#L50-L54 . ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
