klin333 commented on PR #823: URL: https://github.com/apache/arrow-nanoarrow/pull/823#issuecomment-3499871805
ohhhhhh. i think i understand nanoarrow a bit better now. I believe `convert_array_stream` currently doesn't use string ALTREP at all. Only `convert_array` uses ALTREP. In the PR change, by converting the collected array batches, I've actually changed **to** using ALTREP, but only for each converted batch. As soon as I vec_rbind the converted batches, ALTREP strings get materialized, and the resulting single data.frame does not use ALTREP. However, even though vec_rbind forces materialization of ALTREP strings, ie negating the benefits of ALTREP, the total convert array batches path is still good. Essentially, the PR convert array batches path is not great, but good enough. The convert_array_stream(basic_stream) path is just extremely bad with exponential time behaviour. Ie this PR qualifies only as a workaround, as opposed to root fix. convert_array_stream(basic_stream) should not be taking exponential time, and this PR does nothing to fix it. Perhaps it's to do with R global string pool? Perhaps by using ALTREP in converted array batches, and concatenating n_batches of them, the global string pool only gets altered (n_batch * ncol) times? I have no idea how the convert_array_stream(basic_stream) path works, perhaps it results in the R global string pool being altered (n_rows * ncols) times? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
