paleolimbot opened a new pull request, #279:
URL: https://github.com/apache/arrow-nanoarrow/pull/279
When collecting an array stream with unknown size into a data.frame,
nanoarrow has pretty terrible performance. This is because it collects and
converts all batches and does `c()` or `rbind()` on the result. This is
particularly bad when collecting many tiny batches (e.g., like those returned
by many ADBC drivers).
`convert_array_stream()` has long had a "preallocate + fill" mode when
`size` was explicitly set. Recently, the addition of `basic_array_stream()`
makes it possible to recreate an array stream from a previously-collected
result. Collectively, this means we can collect the whole stream, compute the
size, and then call `convert_array_stream()` with a known size.
Before this PR:
``` r
library(nanoarrow)
data_frames <- replicate(
1000,
nanoarrow:::vec_gen(
data.frame(x = logical(), y = double(), z = character()),
n = 1000
),
simplify = FALSE
)
bench::mark(
convert_known_size = {
stream <- basic_array_stream(data_frames, validate = FALSE)
convert_array_stream(stream, size = 1000 * 1000)
},
convert_unknown_size = {
stream <- basic_array_stream(data_frames, validate = FALSE)
as.data.frame(stream)
},
convert_arrow_altrep = {
options(arrow.use_altrep = TRUE)
stream <- basic_array_stream(data_frames, validate = FALSE)
reader <- arrow::as_record_batch_reader(stream)
as.data.frame(as.data.frame(reader))
},
convert_arrow = {
options(arrow.use_altrep = FALSE)
stream <- basic_array_stream(data_frames, validate = FALSE)
reader <- arrow::as_record_batch_reader(stream)
as.data.frame(as.data.frame(reader))
},
min_iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 convert_known_size 196.9ms 234ms 3.85 23.1MB 4.23
#> 2 convert_unknown_size 375.8ms 479ms 2.12 429.3MB 13.2
#> 3 convert_arrow_altrep 67.4ms 164ms 4.78 20.4MB 6.70
#> 4 convert_arrow 107.8ms 240ms 2.96 22.9MB 3.56
```
After this PR:
``` r
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 convert_known_size 203.4ms 225ms 3.99 23.2MB 3.99
#> 2 convert_unknown_size 266.5ms 396ms 0.895 23.1MB 2.60
#> 3 convert_arrow_altrep 68.5ms 214ms 3.76 20.4MB 3.76
#> 4 convert_arrow 130.6ms 227ms 2.93 22.9MB 3.23
```
<sup>Created on 2023-08-17 with [reprex
v2.0.2](https://reprex.tidyverse.org)</sup>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]