thisisnic commented on issue #46178: URL: https://github.com/apache/arrow/issues/46178#issuecomment-2834316472
Hey, I'm experimenting with using LLMs for debugging, and I'm going to summarise what I found here from inputting some code and docs files into chatGPT. I'm going to go slow on this as I don't want to waste folks' time looking into inaccurate solutions, so if this is nonsense, I'll stop with this approach in areas of the codebase I'm not familiar with. > The problem is that in TransferZeroCopy, the Buffers are created directly from memory owned by the Parquet page reader. When the RecordReader or file reader is destroyed, the memory backing those Buffers can disappear - but the Arrow Array still assumes the Buffers stay valid. > > Arrow expects Buffers to own or strongly reference their memory. It doesn't track when memory can be deallocated; it trusts that if a Buffer exists, its memory is alive. > > Fix: > When creating Buffers from RecordReader memory, the code should attach a shared_ptr back to the owner (e.g., the page reader or file reader) so that the memory stays alive for as long as the Buffer does. > Otherwise, downstream operations like Take, filter, or materializing lazy Arrays can cause use-after-free bugs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org