paleolimbot commented on issue #822:
URL:
https://github.com/apache/arrow-nanoarrow/issues/822#issuecomment-3555506255
I'm still working on this but I thought I'd put up my thinking so far, which
is that it's an issue with the preserve/protect. I believe basic_array_stream
is one of the ways that ends up getting called a *lot* (also
`nanoarrow_array_modify()`, which in the reproducer below is I believe the
culprit).
There are two solutions: we can rewrite the offending function
(`array_export()` in array.h I believe is the issue) to not use
preserve/protect, or we can make preserve/protect faster by using cpp11. I'll
investigate these!
``` r
library(nanoarrow)
ascii_bytes <- vapply(letters, charToRaw, raw(1), USE.NAMES = FALSE)
random_string_array <- function(n = 1, n_chars = 16) {
data_buffer <- sample(ascii_bytes, n_chars * n, replace = TRUE)
offsets_buffer <- as.integer(seq(0, n * n_chars, length.out = n + 1))
nanoarrow_array_modify(
nanoarrow_array_init(na_string()),
list(
length = n,
null_count = 0,
buffers = list(NULL, offsets_buffer, data_buffer)
)
)
}
random_string_struct <- function(n_rows = 1024, n_cols = 1, n_chars = 16) {
col_names <- sprintf("col%03d", seq_len(n_cols))
col_types <- rep(list(na_string()), n_cols)
names(col_types) <- col_names
schema <- na_struct(col_types)
columns <- lapply(
col_names,
function(...) random_string_array(n_rows, n_chars = n_chars)
)
nanoarrow_array_modify(
nanoarrow_array_init(schema),
list(
length = n_rows,
null_count = 0,
children = columns
)
)
}
random_string_batches <- function(n_batches = 1, n_rows = 1, n_cols = 1,
n_chars = 16) {
lapply(
seq_len(n_batches),
function(...) random_string_struct(n_rows, n_cols, n_chars)
)
}
batches <- random_string_batches(n_batches = 100, n_cols = 160)
nanoarrow:::preserved_count()
#> [1] 128320
system.time(gc(), gcFirst = FALSE)
#> user system elapsed
#> 0.109 0.000 0.109
batches <- NULL
system.time(gc(), gcFirst = FALSE)
#> user system elapsed
#> 18.823 0.052 18.890
```
<sup>Created on 2025-11-19 with [reprex
v2.1.1](https://reprex.tidyverse.org)</sup>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]