[jira] [Commented] (ARROW-13187) [c++][python] Possibly memory not deallocated when reading in CSV

Weston Pace (Jira) Fri, 25 Jun 2021 19:42:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369785#comment-17369785
 ]


Weston Pace commented on ARROW-13187:
-------------------------------------

I have tracked down the cause further but I'm not entirely sure what the 
correct fix should be but I think it is a problem in cython.  The issue first 
occurs after commit 79ae4f6db3dfe06ba2e1b5c285a6695cfa58cf3d (ARROW-8732: [C++] 
Add basic cancellation API)

 

The method "read_csv" calls "SignalStopHandler()" which calls 
"signal.getsignal" which calls "signal.py::_int_to_enum" which intentionally 
triggers a ValueError (as is normal in python).

 

That ValueError has an associated traceback which is not disposed of correctly. 
 That traceback has a reference to each of the frames of the stack and one of 
those frames has a reference to "table".  Since a new traceback is generated 
for each loop of the iteration none of the CSVs are properly disposed of.  The 
slice method in the original PR or "del table" is a workable workaround.  As 
long as the frames aren't too big the garbage collector will eventually run and 
clean them up long before memory is lost.

 

I have no idea why the ValueError/traceback are not being disposed of.  I know 
cython has to do some games to manage tracebacks so it's possible there is an 
issue there.  I think I created a reproduction in pure python calling getsignal 
and it seems to manage memory correctly so I believe python is clear.

 

I've created a script to reproduce that also uses objgraph to generate 
reference graphs.  It also only runs one iteration so it is quicker and doesn't 
exceed RAM on the system.  It should print 0 as the last line.  If there is a 
leak it prints out ~270M.

 

 
{code:java}
import gc
import sys
import pyarrow.parquet
import pyarrow as pa
import pyarrow.csv
import objgraph

# Generate some CSV file to read in                                             
                                                                                
                                                   
print("Generating CSV")
with open("example.csv", "w+") as f_out:
    for i in range(0, 10000000):
        unused = f_out.write("123456789,abc def ghi jkl\n")

def read_in_the_csv():
    table = pa.csv.read_csv("example.csv")
    print(pa.total_allocated_bytes())

gc.disable()
gc.collect()
objs = gc.get_objects()
read_in_the_csv()
objs2 = gc.get_objects()
offensive_ids = set([id(obj) for obj in objs2]) - set([id(obj) for obj in objs])
badobjs = [obj for obj in objs2 if id(obj) in offensive_ids]
print(len(badobjs))
smallbadobjs = [obj for obj in badobjs if 'frame' in str(type(obj)) and 
'read_in_the_csv' in str(obj)]
objgraph.show_refs(smallbadobjs, refcounts=True)
objgraph.show_backrefs(smallbadobjs, refcounts=True)
print(pa.total_allocated_bytes())

{code}
 

So at this point I surrender and ask [~apitrou] [~jorisvandenbossche] or 
[~amol-] for help :)

 

*Forward refs show a frame in the traceback still reference 
Table*!forward-refs.png!

*Backward refs show the frame is referenced as part of a traceback (note, this 
graph is truncated, and does not show the source ValueError.  Also, the dict 
and two lists are from my debugging code and not related to the issue)*

*!backward-refs.png!*

 

 

> [c++][python] Possibly memory not deallocated when reading in CSV
> -----------------------------------------------------------------
>
>                 Key: ARROW-13187
>                 URL: https://issues.apache.org/jira/browse/ARROW-13187
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 4.0.1
>            Reporter: Simon
>            Priority: Minor
>         Attachments: backward-refs.png, forward-refs.png
>
>
> When one reads in a table from CSV in pyarrow version 4.0.1, it appears that 
> the read-in table variable is not freed (or not fast enough). I'm unsure if 
> this is because of pyarrow or because of the way pyarrow memory allocation 
> interacts with Python memory allocation. I encountered it when processing 
> many large CSVs sequentially.
> When I run the following piece of code, the RAM memory usage increases quite 
> rapidly until it runs out of memory.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv
> # Generate some CSV file to read in
> print("Generating CSV")
> with open("example.csv", "w+") as f_out:
>     for i in range(0, 10000000):
>         f_out.write("123456789,abc def ghi jkl\n")
> def read_in_the_csv():
>     table = pa.csv.read_csv("example.csv")
>     print(table)  # Not strictly necessary to replicate bug, table can also 
> be an unused variable
>     # This will free up the memory, as a workaround:
>     # table = table.slice(0, 0)
> # Read in the CSV many times
> print("Reading in a CSV many times")
> for j in range(100000):
>     read_in_the_csv()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13187) [c++][python] Possibly memory not deallocated when reading in CSV

Reply via email to