klin333 commented on PR #823:
URL: https://github.com/apache/arrow-nanoarrow/pull/823#issuecomment-3499871805

   ohhhhhh. i think i understand nanoarrow a bit better now.
   
   I believe `convert_array_stream` currently doesn't use string ALTREP at all. 
Only `convert_array` uses ALTREP. In the PR change, by converting the collected 
array batches, I've actually changed **to** using ALTREP, but only for each 
converted batch. As soon as I vec_rbind the converted batches, ALTREP strings 
get materialized, and the resulting single data.frame does not use ALTREP.
   
   However, even though vec_rbind forces materialization of ALTREP strings, ie 
negating the benefits of ALTREP, the total convert array batches path is still 
good. 
   
   Essentially, the PR convert array batches path is not great, but good 
enough. The convert_array_stream(basic_stream) path is just extremely bad with 
exponential time behaviour. Ie this PR qualifies only as a workaround, as 
opposed to root fix. convert_array_stream(basic_stream) should not be taking 
exponential time, and this PR does nothing to fix it.
   
   Perhaps it's to do with R global string pool? Perhaps by using ALTREP in 
converted array batches, and concatenating n_batches of them, the global string 
pool only gets altered (n_batch * ncol) times? I have no idea how the 
convert_array_stream(basic_stream) path works, perhaps it results in the R 
global string pool being altered (n_rows * ncols) times?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to