klin333 opened a new pull request, #823:
URL: https://github.com/apache/arrow-nanoarrow/pull/823

   As noted in #822, nanoarrow_array_stream -> data.frame conversion is 
extremely slow, and suffers near exponential time scaling, probably due to 
extremely inefficient string ALTREP materialization.
   
   Based on the reprex in that issue, the original time is 
   | num_cols | elapsed_with_arrow | elapsed_without_arrow |
   |----------:|-------------------:|----------------------:|
   | 10        | 2.4 secs      | 1.3 secs         |
   | 20        | 3.2 secs      | 2.9 secs         |
   | 40        | 6.1 secs      | 7.0 secs         |
   | 80        | 12.7 secs     | 30.8 secs        |
   | 160      | 26.9 secs       | **920.7 secs**          |
   
   After this fix, the elapsed time are:
   | num_cols | elapsed_with_arrow | elapsed_without_arrow |
   |-----------|--------------------|------------------------|
   | 10        | 2.4 secs      | 1.3 secs          |
   | 20        | 3.2 secs      | 2.5 secs          |
   | 40        | 6.3 secs      | 5.3 secs          |
   | 80        | 14.2 secs     | 12.4 secs         |
   | 160       | 27.3 secs     | **25.6 sec**s         |  
   
   This slowness in convert_array_stream was previously noted in a comment in 
#219. 
   
   The intuition of this PR is that the original recreation of a single 
array_stream from the already collected batches of arrays, is probably very 
inefficient. I don't fully understand why we previously did not directly 
convert collected array to data.frame, given the function already exists. Maybe 
it's because we don't want to be binding batches of converted data.frame in R?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to