klin333 opened a new pull request, #823: URL: https://github.com/apache/arrow-nanoarrow/pull/823
As noted in #822, nanoarrow_array_stream -> data.frame conversion is extremely slow, and suffers near exponential time scaling, probably due to extremely inefficient string ALTREP materialization. Based on the reprex in that issue, the original time is | num_cols | elapsed_with_arrow | elapsed_without_arrow | |----------:|-------------------:|----------------------:| | 10 | 2.4 secs | 1.3 secs | | 20 | 3.2 secs | 2.9 secs | | 40 | 6.1 secs | 7.0 secs | | 80 | 12.7 secs | 30.8 secs | | 160 | 26.9 secs | **920.7 secs** | After this fix, the elapsed time are: | num_cols | elapsed_with_arrow | elapsed_without_arrow | |-----------|--------------------|------------------------| | 10 | 2.4 secs | 1.3 secs | | 20 | 3.2 secs | 2.5 secs | | 40 | 6.3 secs | 5.3 secs | | 80 | 14.2 secs | 12.4 secs | | 160 | 27.3 secs | **25.6 sec**s | This slowness in convert_array_stream was previously noted in a comment in #219. The intuition of this PR is that the original recreation of a single array_stream from the already collected batches of arrays, is probably very inefficient. I don't fully understand why we previously did not directly convert collected array to data.frame, given the function already exists. Maybe it's because we don't want to be binding batches of converted data.frame in R? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
