[
https://issues.apache.org/jira/browse/ARROW-13187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369785#comment-17369785
]
Weston Pace commented on ARROW-13187:
-------------------------------------
I have tracked down the cause further but I'm not entirely sure what the
correct fix should be but I think it is a problem in cython. The issue first
occurs after commit 79ae4f6db3dfe06ba2e1b5c285a6695cfa58cf3d (ARROW-8732: [C++]
Add basic cancellation API)
The method "read_csv" calls "SignalStopHandler()" which calls
"signal.getsignal" which calls "signal.py::_int_to_enum" which intentionally
triggers a ValueError (as is normal in python).
That ValueError has an associated traceback which is not disposed of correctly.
That traceback has a reference to each of the frames of the stack and one of
those frames has a reference to "table". Since a new traceback is generated
for each loop of the iteration none of the CSVs are properly disposed of. The
slice method in the original PR or "del table" is a workable workaround. As
long as the frames aren't too big the garbage collector will eventually run and
clean them up long before memory is lost.
I have no idea why the ValueError/traceback are not being disposed of. I know
cython has to do some games to manage tracebacks so it's possible there is an
issue there. I think I created a reproduction in pure python calling getsignal
and it seems to manage memory correctly so I believe python is clear.
I've created a script to reproduce that also uses objgraph to generate
reference graphs. It also only runs one iteration so it is quicker and doesn't
exceed RAM on the system. It should print 0 as the last line. If there is a
leak it prints out ~270M.
{code:java}
import gc
import sys
import pyarrow.parquet
import pyarrow as pa
import pyarrow.csv
import objgraph
# Generate some CSV file to read in
print("Generating CSV")
with open("example.csv", "w+") as f_out:
for i in range(0, 10000000):
unused = f_out.write("123456789,abc def ghi jkl\n")
def read_in_the_csv():
table = pa.csv.read_csv("example.csv")
print(pa.total_allocated_bytes())
gc.disable()
gc.collect()
objs = gc.get_objects()
read_in_the_csv()
objs2 = gc.get_objects()
offensive_ids = set([id(obj) for obj in objs2]) - set([id(obj) for obj in objs])
badobjs = [obj for obj in objs2 if id(obj) in offensive_ids]
print(len(badobjs))
smallbadobjs = [obj for obj in badobjs if 'frame' in str(type(obj)) and
'read_in_the_csv' in str(obj)]
objgraph.show_refs(smallbadobjs, refcounts=True)
objgraph.show_backrefs(smallbadobjs, refcounts=True)
print(pa.total_allocated_bytes())
{code}
So at this point I surrender and ask [~apitrou] [~jorisvandenbossche] or
[~amol-] for help :)
*Forward refs show a frame in the traceback still reference
Table*!forward-refs.png!
*Backward refs show the frame is referenced as part of a traceback (note, this
graph is truncated, and does not show the source ValueError. Also, the dict
and two lists are from my debugging code and not related to the issue)*
*!backward-refs.png!*
> [c++][python] Possibly memory not deallocated when reading in CSV
> -----------------------------------------------------------------
>
> Key: ARROW-13187
> URL: https://issues.apache.org/jira/browse/ARROW-13187
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 4.0.1
> Reporter: Simon
> Priority: Minor
> Attachments: backward-refs.png, forward-refs.png
>
>
> When one reads in a table from CSV in pyarrow version 4.0.1, it appears that
> the read-in table variable is not freed (or not fast enough). I'm unsure if
> this is because of pyarrow or because of the way pyarrow memory allocation
> interacts with Python memory allocation. I encountered it when processing
> many large CSVs sequentially.
> When I run the following piece of code, the RAM memory usage increases quite
> rapidly until it runs out of memory.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv
> # Generate some CSV file to read in
> print("Generating CSV")
> with open("example.csv", "w+") as f_out:
> for i in range(0, 10000000):
> f_out.write("123456789,abc def ghi jkl\n")
> def read_in_the_csv():
> table = pa.csv.read_csv("example.csv")
> print(table) # Not strictly necessary to replicate bug, table can also
> be an unused variable
> # This will free up the memory, as a workaround:
> # table = table.slice(0, 0)
> # Read in the CSV many times
> print("Reading in a CSV many times")
> for j in range(100000):
> read_in_the_csv()
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)