[I] [R] `RecordBatchReader$batches()` is very slow [arrow]

via GitHub Tue, 05 Dec 2023 17:42:15 -0800


paleolimbot opened a new issue, #39090:
URL: https://github.com/apache/arrow/issues/39090


   ### Describe the enhancement requested
   
   As identified by @ianmcook in 
https://github.com/apache/arrow/pull/39081#discussion_r1416239595
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   library(nanoarrow)
   
   rows_per_batch <- 4096
   n_batches <- 4096
   df <- tibble::tibble(a = 1:rows_per_batch, b = a, c = a, d = a)
   batches <- lapply(rep(list(df), n_batches), as_record_batch)
   
   make_reader_batches <- function() {
     RecordBatchReader$create(batches = batches)
   }
   
   make_array_stream_batches <- function() {
     as_nanoarrow_array_stream(make_reader_batches())
   }
   
   bench::mark(
     # All C++, no R list involved
     make_reader_batches()$read_table(),
     # Unoptimized pure R implementation
     collect_array_stream(make_array_stream_batches()),
     # Theoretically optimized cpp11 code?
     make_reader_batches()$batches(),
     check = FALSE
   )
   #> Warning: Some expressions had a GC in every iteration; so filtering is
   #> disabled.
   #> # A tibble: 3 × 6
   #>   expression                           min   median `itr/sec` mem_alloc 
`gc/sec`
   #>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>   
 <dbl>
   #> 1 make_reader_batches()$read_tab…   8.01ms   8.34ms    111.      1.09MB   
  7.94
   #> 2 collect_array_stream(make_arra…  21.25ms  21.85ms     37.5   682.44KB   
  5.93
   #> 3 make_reader_batches()$batches()  712.9ms  712.9ms      1.40    8.74MB   
 16.8
   ```
   
   This might be unfixable (i.e., it might be that that's just how long it 
takes to create 4096 R6 objects); however, we should possibly investigate to 
make sure the path that I think is being taken is actually happening: 
https://github.com/apache/arrow/blob/main/r/src/arrow_cpp11.h#L312-L334 + 
https://github.com/apache/arrow/blob/main/r/src/recordbatchreader.cpp#L50-L54 .
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [R] `RecordBatchReader$batches()` is very slow [arrow]

Reply via email to