paleolimbot opened a new pull request, #279:
URL: https://github.com/apache/arrow-nanoarrow/pull/279

   When collecting an array stream with unknown size into a data.frame, 
nanoarrow has pretty terrible performance. This is because it collects and 
converts all batches and does `c()` or `rbind()` on the result. This is 
particularly bad when collecting many tiny batches (e.g., like those returned 
by many ADBC drivers).
   
   `convert_array_stream()` has long had a "preallocate + fill" mode when 
`size` was explicitly set. Recently, the addition of `basic_array_stream()` 
makes it possible to recreate an array stream from a previously-collected 
result. Collectively, this means we can collect the whole stream, compute the 
size, and then call `convert_array_stream()` with a known size.
   
   Before this PR:
   
   ``` r
   library(nanoarrow)
   
   data_frames <- replicate(
     1000,
     nanoarrow:::vec_gen(
       data.frame(x = logical(), y = double(), z = character()),
       n = 1000
     ),
     simplify = FALSE
   )
   
   bench::mark(
     convert_known_size = {
       stream <- basic_array_stream(data_frames, validate = FALSE)
       convert_array_stream(stream, size = 1000 * 1000)
     },
     convert_unknown_size = {
       stream <- basic_array_stream(data_frames, validate = FALSE)
       as.data.frame(stream)
     },
     convert_arrow_altrep = {
       options(arrow.use_altrep = TRUE)
       stream <- basic_array_stream(data_frames, validate = FALSE)
       reader <- arrow::as_record_batch_reader(stream)
       as.data.frame(as.data.frame(reader))
     },
     convert_arrow = {
       options(arrow.use_altrep = FALSE)
       stream <- basic_array_stream(data_frames, validate = FALSE)
       reader <- arrow::as_record_batch_reader(stream)
       as.data.frame(as.data.frame(reader))
     },
     min_iterations = 20
   )
   #> Warning: Some expressions had a GC in every iteration; so filtering is
   #> disabled.
   #> # A tibble: 4 × 6
   #>   expression                min   median `itr/sec` mem_alloc `gc/sec`
   #>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
   #> 1 convert_known_size    196.9ms    234ms      3.85    23.1MB     4.23
   #> 2 convert_unknown_size  375.8ms    479ms      2.12   429.3MB    13.2 
   #> 3 convert_arrow_altrep   67.4ms    164ms      4.78    20.4MB     6.70
   #> 4 convert_arrow         107.8ms    240ms      2.96    22.9MB     3.56
   ```
   
   After this PR:
   
   ``` r
   #> Warning: Some expressions had a GC in every iteration; so filtering is
   #> disabled.
   #> # A tibble: 4 × 6
   #>   expression                min   median `itr/sec` mem_alloc `gc/sec`
   #>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
   #> 1 convert_known_size    203.4ms    225ms     3.99     23.2MB     3.99
   #> 2 convert_unknown_size  266.5ms    396ms     0.895    23.1MB     2.60
   #> 3 convert_arrow_altrep   68.5ms    214ms     3.76     20.4MB     3.76
   #> 4 convert_arrow         130.6ms    227ms     2.93     22.9MB     3.23
   ```
   
   <sup>Created on 2023-08-17 with [reprex 
v2.0.2](https://reprex.tidyverse.org)</sup>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to