westonpace edited a comment on pull request #11730:
URL: https://github.com/apache/arrow/pull/11730#issuecomment-1016058691


   I dug into this a little bit today.  I used Dewey's latest reprex but for 
reading I just used this (to avoid carrow at the moment):
   
   ```
     x <- open_dataset(tf) %>%
       to_duckdb(auto_disconnect = FALSE) %>%
       to_arrow() %>%
       dplyr::collect()
     print(x, max_footer_lines=100, n=100)
   ```
   
   Warning, the following is a barrage of facts with a smattering of 
assumptions / conclusions at the end.
   
   I found a few oddities.  In that one query I expected to see two exec plans 
get created.  The first plan sources from files and sinks into duckdb (crossing 
the bridge into duckdb).  The second plan sources from duckdb (crossing the 
bridge back into arrow) and sinks into R.
   
   In reality there were five exec plans created.  Four of these came from 
calls to `RArrowTabularStreamFactory::Produce` (duckdb code).  All four calls 
to `Produce` had the same `factory_p`.  All four of these yield 4 batches (as 
expected) although the first three scans don't appear to be consumed.  The 
first three calls had an empty `project_columns` and a null `filters`.  The 
fourth call had a valid `project_columns` and a valid `filters`.
   
   Across the run 12 instances of `ExportedArrayStream` are created.  Only 4 of 
these are released.  I don't *think* this is a problem but figured I'd record 
it.  I believe the private data here is a shared_ptr so these could just be 
copies but for something that has a `Release` method I expected parity between 
creation and `Release`.  Each of the first three calls to 
`RArrowTabularStreamFactory::Produce` generates 2 `ExportedArrayStream` 
instances and releases 1.  None of the `ExportedArrayStream`'s created by these 
calls ever yields any actual arrays.
   
   The final call to `RArrowTabularStreamFactory::Produce` generates 6 
instances of `ExportedArrayStream` and these generate 9 arrays each.  All of 
the arrays are released (although rather quickly).  So there are a total of 36 
calls to send an array from Arrow to DuckDB and all 36 are released.
   
   All 36 of these arrays seem valid and to be containing the correct data.
   
   The final exec plan (which reads from duckdb's bridge and converts back to 
R) yields four batches.  Some of these batches are repeated and/or contain 
garbage information.
   
   **My theory is that the arrays are being released by DuckDb before they are 
finished on the R side**
   
   I did a bit more investigation to try and come up with some conclusive 
evidence to this regard.
   
   If I put a breakpoint on Arrow's release method I find that the array data 
is being released at `duckdb/src/function/table/arrow.cpp:1078` which seems 
incorrect.  Basically DuckDb is holding onto the batch until the next batch is 
read in and then it discards the old batch.
   
   One possible theory is that DuckDb is making a copy of the array data and so 
it is ok to release the array data on the Arrow side after this copy is made.  
However, if I look at the buffer pointers that are exported (from Arrow) and 
the buffer pointers that are imported (from DuckDb) many of them match.  If 
this is a "use after free" scenario it makes sense that not all buffer pointers 
are overwritten.  However, if this is a "DuckDb makes a copy" scenario then I 
would expect all of the buffer pointers to be distinct.  Furthermore, the 
buffers imported from DuckDb do not appear valid.  I have more NULLs than I 
would expect and some of them are pointers to parts of memory that are not on 
the heap.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to