lidavidm commented on PR #13248: URL: https://github.com/apache/arrow/pull/13248#issuecomment-1215899098
> > C Data Interface allocations need to get associated with an actual allocator, even if the allocator isn't actually doing the allocation. > > I guess this is reasonable given where we are, but associating with an allocator that isn't doing the allocation seems likely to confuse future developers if the memory appears to be managed in the Java code but isn't. Is the idea that all users of a given bit of memory keep track of their own use, but only the originator (C++ presumably in this case) actually frees it? @zhztheplayer correct me if I'm wrong, but what I see as the issue is this: data comes in via the C Data Interface. You then want to unload the data into an ArrowRecordBatch and load it into another root (that happens to have its own allocator) to perform some computation on it. But since transfer is not implemented, that currently fails. The problem is that this transfer, _semantically_, is to change the ownership of the buffer. So yes, an allocator would end up with memory it didn't technically allocate. But that's already the case - the whole point of these APIs is to make an allocator responsible for something it didn't allocate! The main implementation problem I see is that currently, the transfer assumes the memory was allocated by an allocator in the same _allocator tree_, i.e. that the allocator implementation is the same. That's the main assumption that would have to change. Perhaps 'allocator' is the wrong way to think about the API if/when we make the change; BufferAllocator is primarily an arena or nursery used for accounting of buffers, that can also allocate new buffers that are associated with itself. In this model, it makes perfect sense to transfer buffers between allocators: it's just a conscious choice to hand over memory ownership. (My memory of the details is fuzzy, but the underlying refcount doesn't really change; it's just changes whether the allocator will complain about a buffer when closed or not.) The Java library works by refcounting a buffer, and freeing it when the reference count reaches 0. From this perspective, memory from the C Data Interface is no different, it's just another kind of buffer to refcount. When the Java-side refcount reaches 0, the C-side release callback is invoked (this is already how it works). > > Ok, thanks. In that case, it seems like we need the ability to associate the C Data reference manager with a Java allocator, and update APIs to require a BufferAllocator to associate with. I don't think that'll require breaking changes in the core APIs, but it would probably mean updating/deprecating APIs in the C Data interface. > > Since the API already has a bufferAllocator argument (that is ignored), I'm not sure what change you're suggesting unless it's simply to annotate the argument as non-nullable. If there was no bufferAllocator argument it would be simple enough to deprecate the current method and add a new one. Yes, this argument would become required. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
