klin333 commented on issue #822:
URL: 
https://github.com/apache/arrow-nanoarrow/issues/822#issuecomment-3514375809

   Ok, I had an epiphany. I am using snowflake adbc multi-batch array stream of 
lots of unique strings. Normally convert_array_stream() will result in 
exponential time hanging (the call itself finishes fine, but the immediate next 
arbitrary line will have exponential hang).
   
   But, by supplying convert_array_stream() with known size, the exponential 
time problem is gone. See illustrative example below. The exact same 
multi-batch snowflake adbc stream. Only difference is supplying the known size.
   
   This means, the inefficiency is in the collect_array_stream() and 
nanoarrow_c_basic_array_stream() parts, ie restreaming in R. ChatGPT reasons 
that it is likely due to the recursive array_export() in 
nanoarrow_c_basic_array_stream(), resulting in large number of crossing the C 
and R boundary, that eventually cripples R garbage collection, and/or global 
string pool. 
   
   The logical solution seems to be: calculate the array_stream total length, 
and restream in C. This massively reduces the C/R boundary crossing. I've done 
it here in a fork: 
https://github.com/apache/arrow-nanoarrow/compare/main...klin333:arrow-nanoarrow:feature-r-restream
  The C function nanoarrow_c_array_stream_total_length is totally written by 
ChatGPT, so if this is the right way to go, heavy code review is required.
   
   On the bright side, this proposed fork totally fixes the exponential time 
behaviour observed in my original post. And it does so without my previous 
workaround of using convert_array. This directly fixes the problem with 
convert_array_stream. 
   
   ```r
   uri <- "secret"
   adbc_con <- adbcsnowflake::adbcsnowflake() |>
     adbcdrivermanager::adbc_database_init(uri = uri) |>
     adbcdrivermanager::adbc_connection_init()
   
   t1 <- Sys.time()
   stream <- adbcdrivermanager::read_adbc(adbc_con, 'SELECT * from "LARGE160"')
   
   x <- stream
   to <- nanoarrow::infer_nanoarrow_ptype(x$get_schema())
   df <- nanoarrow::convert_array_stream(x, to)  # setting size = 1e5 (known 
value) here fixes exponential time problem
   .Internal(inspect(df[[1]]))  # note weirdly, without size argument, it's 
this line (any command) after convert_array_stream that hangs
   df <- tibble::as_tibble(df)
   .Internal(inspect(df[[1]]))
   gc()
   print(df)
   print(Sys.time() - t1)
   ```
   
   Same test from original post, after proposed restream in C:
   | num_cols | elapsed_with_arrow | elapsed_without_arrow |
   |-----------|--------------------|------------------------|
   | 10        | 1.7 secs      | 1.6 secs          |
   | 20        | 4.4 secs      | 2.9 secs          |
   | 40        | 7.1 secs      | 6.4 secs          |
   | 80        | 12.9 secs     | 15.7 secs         |
   | 160       | 28.4 secs     | **27.2 secs**         |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to