westonpace edited a comment on pull request #11730:
URL: https://github.com/apache/arrow/pull/11730#issuecomment-1016058691
I dug into this a little bit today. I used Dewey's latest reprex but for
reading I just used this (to avoid carrow at the moment):
```
x <- open_dataset(tf) %>%
to_duckdb(auto_disconnect = FALSE) %>%
to_arrow() %>%
dplyr::collect()
print(x, max_footer_lines=100, n=100)
```
Warning, the following is a barrage of facts with a smattering of
assumptions / conclusions at the end.
I found a few oddities. In that one query I expected to see two exec plans
get created. The first plan sources from files and sinks into duckdb (crossing
the bridge into duckdb). The second plan sources from duckdb (crossing the
bridge back into arrow) and sinks into R.
In reality there were five exec plans created. Four of these came from
calls to `RArrowTabularStreamFactory::Produce` (duckdb code). All four calls
to `Produce` had the same `factory_p`. All four of these yield 4 batches (as
expected) although the first three scans don't appear to be consumed. The
first three calls had an empty `project_columns` and a null `filters`. The
fourth call had a valid `project_columns` and a valid `filters`.
Across the run 12 instances of `ExportedArrayStream` are created. Only 4 of
these are released. I don't *think* this is a problem but figured I'd record
it. I believe the private data here is a shared_ptr so these could just be
copies but for something that has a `Release` method I expected parity between
creation and `Release`. Each of the first three calls to
`RArrowTabularStreamFactory::Produce` generates 2 `ExportedArrayStream`
instances and releases 1. None of the `ExportedArrayStream`'s created by these
calls ever yields any actual arrays.
The final call to `RArrowTabularStreamFactory::Produce` generates 6
instances of `ExportedArrayStream` and these generate 9 arrays each. All of
the arrays are released (although rather quickly). So there are a total of 36
calls to send an array from Arrow to DuckDB and all 36 are released.
All 36 of these arrays seem valid and to be containing the correct data.
The final exec plan (which reads from duckdb's bridge and converts back to
R) yields four batches. Some of these batches are repeated and/or contain
garbage information.
**My theory is that the arrays are being released by DuckDb before they are
finished on the R side**
I did a bit more investigation to try and come up with some conclusive
evidence to this regard.
If I put a breakpoint on Arrow's release method I find that the array data
is being released at `duckdb/src/function/table/arrow.cpp:1078` which seems
incorrect. Basically DuckDb is holding onto the batch until the next batch is
read in and then it discards the old batch.
One possible theory is that DuckDb is making a copy of the array data and so
it is ok to release the array data on the Arrow side after this copy is made.
However, if I look at the buffer pointers that are exported (from Arrow) and
the buffer pointers that are imported (from DuckDb) many of them match. If
this is a "use after free" scenario it makes sense that not all buffer pointers
are overwritten. However, if this is a "DuckDb makes a copy" scenario then I
would expect all of the buffer pointers to be distinct. Furthermore, the
buffers imported from DuckDb do not appear valid. I have more NULLs than I
would expect and some of them are pointers to parts of memory that are not on
the heap.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]