imathews opened a new issue, #38212:
URL: https://github.com/apache/arrow/issues/38212
### Describe the bug, including details regarding any error messages,
version, and platform.
My workflow involves streaming data in the IPC streaming format, and then
writing RecordBatches to a file. In both R and Python, I'm observing growing
memory usage as the batches are streamed, suggesting that the RecordBatches
aren't being deallocated after being written to the file (the observed memory
usage maps directly to the size of batches that have been streamed). Neither
python nor R report any significant memory usage, suggesting this is happening
at the C++ layer (manually trigger a `gc()` in R has no effect).
Interestingly, memory is freed immediately upon removing the file from disk
via system calls.
My code is as follows:
```py
# Python
arrow_response = requests.get(url, stream=True)
record_batch_reader = pa.ipc.RecordBatchStreamReader(arrow_response.raw)
# Memory issues are the same using native python file interface
with pa.OSFile("/tmp/outfile", mode="wb") as f:
record_batch_writer = pyarrow.ipc.RecordBatchFileWriter(f, schema=schema)
# Memory is observed to increment with each batch written, and is never
freed, even after the writer + file are closed
for batch in record_batch_reader:
record_batch_writer.write_batch(batch) # if this line is removed, we
don't see any memory overhead — suggesting the issue is the writer, not reader
record_batch_writer.close()
# Removing the file frees memory
# os.remove("/tmp/outfile")
```
```R
# R
output_file <- arrow::FileOutputStream$create(output_file_path)
con <- url("some url", open = "rb", headers=headers)
stream_reader <-
arrow::RecordBatchStreamReader$create(getNamespace("arrow")$MakeRConnectionInputStream(con))
stream_writer <- arrow::RecordBatchFileWriter$create(output_file,
schema=schema)
while (TRUE){
batch <- stream_reader$read_next_batch()
if (is.null(batch)){
break
} else {
stream_writer$write_batch(batch)
}
}
stream_writer$close()
output_file$close()
```
### Component(s)
Python, R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]