paleolimbot commented on issue #219: URL: https://github.com/apache/arrow-nanoarrow/issues/219#issuecomment-1924035325
I'm not sure I can solve your problem; however, I'll explain what I *think* is happening and perhaps that will help. ``` # bind_rows on list of tibbles is very fast, median time 200ms. rbind is hopeless 4 seconds. ``` The reason that `bind_rows()` is faster than `convert_array_stream()` here is that the strings area already R strings. R has a global string pool and all of the strings are already R objects, so the "bind" operation is "just" copying pointers to existing strings. The "slow" operation -- and the reason why ALTREP strings are used so frequently -- is inserting turning a C string (series of bytes) into an R string (because it involves checking the global string pool to see if one exists, and if it does not, inserting it). ``` # single stream (instant) ``` The reason this is instantaneous is because it functionally is just wrapping the ArrowArray in an R object via ALTREP. This is very good if you're pretty sure you won't need all of those strings; it's somewhat slower if you know that you are going to turn all of those objects into R strings anyway. To get a sense of how long it takes to materialize all of those strings into R strings, you can run `nanoarrow:::nanoarrow_altrep_force_materialize(single_converted, recursive = TRUE)`. For me that took about 4 seconds. ``` # batched streams (3 seconds) ``` The way that `convert_array_stream()` currently handles this is to (1) collect all of the arrays into memory, (2) allocate one big long `character()`, and (3) fill it in batch by batch. I would expect this to take the exact same amount of time as `nanoarrow:::nanoarrow_altrep_force_materialize(single_converted, recursive = TRUE)` since I think it uses the same code to materialize the strings into R land. ``` # diy convert (batch conversion is super fast, but the bind_rows now takes 3 seconds) ``` Here, each individual batch is very fast because each individual `character()` vector is converted via ALTREP, which does not attempt to insert any strings into the R global string pool. When `bind_rows()` is called, all of those strings get inserted into the R global string pool in pretty much the exact same way as they would have been in your `# batched streams` example. ``` # create deep copy of the data frame list, where the vectors inside each tibble are deep copies # then bind_rows becomes fast again (median 200 ms) ``` This is not an accurate representation: your `bind_rows()` call above resulted in all of the values in `tmp` being "materialized": they are no longer ALTREP and are just thin wrappers around a regular R character vector. This is why your performance appears to increase; however, it's just because you're dealing with regular R vectors containing strings that have already been inserted into the global R string pool. I hope that helps! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
