Weston Pace created ARROW-15017:
-----------------------------------
Summary: [Python][C++] pyarrow.ipc.RecordBatchFileReader holding
onto memory after being disposed
Key: ARROW-15017
URL: https://issues.apache.org/jira/browse/ARROW-15017
Project: Apache Arrow
Issue Type: Bug
Components: C++, Python
Reporter: Weston Pace
I'll attach a full reproduction but the important bit is here:
{noformat}
with ipc.RecordBatchFileReader(path) as reader:
table = reader.read_all()
# If you comment out this next line then memory usage will be worse
del reader
df = table.to_pandas()
del table
gc.collect()
{noformat}
The input file is ~3GB. This uses peak ~6GB because of conversion to pandas
and over time that excess 3GB will be returned by jemalloc.
However, if I do not run "del reader" then it uses peak ~9GB and only shrinks
down to 6GB even after a 5 second wait. Since the reader is disposed this was
rather surprising to me.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)