paleolimbot commented on issue #219:
URL: 
https://github.com/apache/arrow-nanoarrow/issues/219#issuecomment-1924035325

   I'm not sure I can solve your problem; however, I'll explain what I *think* 
is happening and perhaps that will help.
   
   ```
   # bind_rows on list of tibbles is very fast, median time 200ms. rbind is 
hopeless 4 seconds.
   ```
   
   The reason that `bind_rows()` is faster than `convert_array_stream()` here 
is that the strings area already R strings. R has a global string pool and all 
of the strings are already R objects, so the "bind" operation is "just" copying 
pointers to existing strings. The "slow" operation -- and the reason why ALTREP 
strings are used so frequently -- is inserting turning a C string (series of 
bytes) into an R string (because it involves checking the global string pool to 
see if one exists, and if it does not, inserting it).
   
   ```
   # single stream (instant)
   ```
   
   The reason this is instantaneous is because it functionally is just wrapping 
the ArrowArray in an R object via ALTREP. This is very good if you're pretty 
sure you won't need all of those strings; it's somewhat slower if you know that 
you are going to turn all of those objects into R strings anyway. To get a 
sense of how long it takes to materialize all of those strings into R strings, 
you can run `nanoarrow:::nanoarrow_altrep_force_materialize(single_converted, 
recursive = TRUE)`. For me that took about 4 seconds.
   
   ```
   # batched streams (3 seconds)
   ```
   
   The way that `convert_array_stream()` currently handles this is to (1) 
collect all of the arrays into memory, (2) allocate one big long `character()`, 
and (3) fill it in batch by batch. I would expect this to take the exact same 
amount of time as 
`nanoarrow:::nanoarrow_altrep_force_materialize(single_converted, recursive = 
TRUE)` since I think it uses the same code to materialize the strings into R 
land.
   
   ```
   # diy convert (batch conversion is super fast, but the bind_rows now takes 3 
seconds)
   ```
   
   Here, each individual batch is very fast because each individual 
`character()` vector is converted via ALTREP, which does not attempt to insert 
any strings into the R global string pool. When `bind_rows()` is called, all of 
those strings get inserted into the R global string pool in pretty much the 
exact same way as they would have been in your `# batched streams` example.
   
   ```
   # create deep copy of the data frame list, where the vectors inside each 
tibble are deep copies
   # then bind_rows becomes fast again (median 200 ms)
   ```
   
   This is not an accurate representation: your `bind_rows()` call above 
resulted in all of the values in `tmp` being "materialized": they are no longer 
ALTREP and are just thin wrappers around a regular R character vector. This is 
why your performance appears to increase; however, it's just because you're 
dealing with regular R vectors containing strings that have already been 
inserted into the global R string pool.
   
   I hope that helps!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to