imathews opened a new issue, #38212:
URL: https://github.com/apache/arrow/issues/38212

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   My workflow involves streaming data in the IPC streaming format, and then 
writing RecordBatches to a file. In both R and Python, I'm observing growing 
memory usage as the batches are streamed, suggesting that the RecordBatches 
aren't being deallocated after being written to the file (the observed memory 
usage maps directly to the size of batches that have been streamed). Neither 
python nor R report any significant memory usage, suggesting this is happening 
at the C++ layer (manually trigger a `gc()` in R has no effect).
   
   Interestingly, memory is freed immediately upon removing the file from disk 
via system calls. 
   
   My code is as follows:
   ```py
   # Python
   
   arrow_response = requests.get(url, stream=True)
   record_batch_reader = pa.ipc.RecordBatchStreamReader(arrow_response.raw)
   
   # Memory issues are the same using native python file interface
   with pa.OSFile("/tmp/outfile", mode="wb") as f:
       record_batch_writer = pyarrow.ipc.RecordBatchFileWriter(f, schema=schema)
   
       # Memory is observed to increment with each batch written, and is never 
freed, even after the writer + file are closed
       for batch in record_batch_reader:   
           record_batch_writer.write_batch(batch) # if this line is removed, we 
don't see any memory overhead — suggesting the issue is the writer, not reader
   
        record_batch_writer.close()
   
   
   # Removing the file frees memory
   # os.remove("/tmp/outfile")
   ```
   
   ```R
   # R
   
   output_file <- arrow::FileOutputStream$create(output_file_path)
   
   con <- url("some url", open = "rb", headers=headers)
   stream_reader <- 
arrow::RecordBatchStreamReader$create(getNamespace("arrow")$MakeRConnectionInputStream(con))
   stream_writer <- arrow::RecordBatchFileWriter$create(output_file, 
schema=schema)
   
   while (TRUE){
       batch <- stream_reader$read_next_batch()
       if (is.null(batch)){
           break
       } else {
           stream_writer$write_batch(batch)
       }
   }
   
   stream_writer$close()
   output_file$close()
   ```
   
   
   ### Component(s)
   
   Python, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to