klin333 commented on issue #822: URL: https://github.com/apache/arrow-nanoarrow/issues/822#issuecomment-3514375809
Ok, I had an epiphany. I am using snowflake adbc multi-batch array stream of lots of unique strings. Normally convert_array_stream() will result in exponential time hanging (the call itself finishes fine, but the immediate next arbitrary line will have exponential hang). But, by supplying convert_array_stream() with known size, the exponential time problem is gone. See illustrative example below. The exact same multi-batch snowflake adbc stream. Only difference is supplying the known size. This means, the inefficiency is in the collect_array_stream() and nanoarrow_c_basic_array_stream() parts, ie restreaming in R. ChatGPT reasons that it is likely due to the recursive array_export() in nanoarrow_c_basic_array_stream(), resulting in large number of crossing the C and R boundary, that eventually cripples R garbage collection, and/or global string pool. The logical solution seems to be: calculate the array_stream total length, and restream in C. This massively reduces the C/R boundary crossing. I've done it here in a fork: https://github.com/apache/arrow-nanoarrow/compare/main...klin333:arrow-nanoarrow:feature-r-restream The C function nanoarrow_c_array_stream_total_length is totally written by ChatGPT, so if this is the right way to go, heavy code review is required. On the bright side, this proposed fork totally fixes the exponential time behaviour observed in my original post. And it does so without my previous workaround of using convert_array. This directly fixes the problem with convert_array_stream. ```r uri <- "secret" adbc_con <- adbcsnowflake::adbcsnowflake() |> adbcdrivermanager::adbc_database_init(uri = uri) |> adbcdrivermanager::adbc_connection_init() t1 <- Sys.time() stream <- adbcdrivermanager::read_adbc(adbc_con, 'SELECT * from "LARGE160"') x <- stream to <- nanoarrow::infer_nanoarrow_ptype(x$get_schema()) df <- nanoarrow::convert_array_stream(x, to) # setting size = 1e5 (known value) here fixes exponential time problem .Internal(inspect(df[[1]])) # note weirdly, without size argument, it's this line (any command) after convert_array_stream that hangs df <- tibble::as_tibble(df) .Internal(inspect(df[[1]])) gc() print(df) print(Sys.time() - t1) ``` Same test from original post, after proposed restream in C: | num_cols | elapsed_with_arrow | elapsed_without_arrow | |-----------|--------------------|------------------------| | 10 | 1.7 secs | 1.6 secs | | 20 | 4.4 secs | 2.9 secs | | 40 | 7.1 secs | 6.4 secs | | 80 | 12.9 secs | 15.7 secs | | 160 | 28.4 secs | **27.2 secs** | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
