klin333 commented on issue #822:
URL: 
https://github.com/apache/arrow-nanoarrow/issues/822#issuecomment-3514835708

   Yep, further diagnosis confirmed it's gc() after restream in R, that hangs 
for exponentially long time as the amount of unique strings go up. 
   
   In the snippet below, i can trigger the 900+ seconds of hang (with 100k rows 
and 160 columns of unique strings), by simply calling gc() - no 
nanoarrow_c_convert_array_stream is done at all, it's just restreaming in R. 
   
   Seems to be consistent with my prior observations that convert_array_stream 
itself seems to return in linear time, but then any subsequent random R command 
hang in exponential time - that's probably when gc() got triggered, where R 
gets crippled trying to gc the `batches` and `basic_stream` from the local 
stack of convert_array_stream()
   
   ```r
   batches <- collect_array_stream(
     array_stream,
     n,
     schema = schema,
     validate = FALSE
   )
   basic_stream <- .Call(nanoarrow_c_basic_array_stream, batches, schema, FALSE)
   
   # any of these will hang in exponential time as the amount of unique strings 
increase
   
   # 1)
   rm(batches)
   gc() # hangs
   
   # 2)
   rm(basic_stream)
   gc() # hangs
   
   # 3)
   rm(batches, basic_stream)
   gc() # hangs
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to