paleolimbot commented on issue #822:
URL: 
https://github.com/apache/arrow-nanoarrow/issues/822#issuecomment-3555506255

   I'm still working on this but I thought I'd put up my thinking so far, which 
is that it's an issue with the preserve/protect. I believe basic_array_stream 
is one of the ways that ends up getting called a *lot* (also 
`nanoarrow_array_modify()`, which in the reproducer below is I believe the 
culprit).
   
   There are two solutions: we can rewrite the offending function 
(`array_export()` in array.h I believe is the issue) to not use 
preserve/protect, or we can make preserve/protect faster by using cpp11. I'll 
investigate these!
   
   ``` r
   library(nanoarrow)
   
   ascii_bytes <- vapply(letters, charToRaw, raw(1), USE.NAMES = FALSE)
   
   random_string_array <- function(n = 1, n_chars = 16) {
     data_buffer <- sample(ascii_bytes, n_chars * n, replace = TRUE)
     offsets_buffer <- as.integer(seq(0, n * n_chars, length.out = n + 1))
     nanoarrow_array_modify(
       nanoarrow_array_init(na_string()),
       list(
         length = n,
         null_count = 0,
         buffers = list(NULL, offsets_buffer, data_buffer)
       )
     )
   }
   
   random_string_struct <- function(n_rows = 1024, n_cols = 1, n_chars = 16) {
     col_names <- sprintf("col%03d", seq_len(n_cols))
     col_types <- rep(list(na_string()), n_cols)
     names(col_types) <- col_names
     schema <- na_struct(col_types)
     
     columns <- lapply(
       col_names,
       function(...) random_string_array(n_rows, n_chars = n_chars)
     )
     
     nanoarrow_array_modify(
       nanoarrow_array_init(schema),
       list(
         length = n_rows,
         null_count = 0,
         children = columns
       )
     )  
   }
   
   random_string_batches <- function(n_batches = 1, n_rows = 1, n_cols = 1, 
n_chars = 16) {
     lapply(
       seq_len(n_batches),
       function(...) random_string_struct(n_rows, n_cols, n_chars)
     )
   }
   
   batches <- random_string_batches(n_batches = 100, n_cols = 160)
   nanoarrow:::preserved_count()
   #> [1] 128320
   system.time(gc(), gcFirst = FALSE)
   #>    user  system elapsed 
   #>   0.109   0.000   0.109
   batches <- NULL
   system.time(gc(), gcFirst = FALSE)
   #>    user  system elapsed 
   #>  18.823   0.052  18.890
   ```
   
   <sup>Created on 2025-11-19 with [reprex 
v2.1.1](https://reprex.tidyverse.org)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to