[
https://issues.apache.org/jira/browse/ARROW-13187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369842#comment-17369842
]
Antoine Pitrou commented on ARROW-13187:
----------------------------------------
This seems to be, well, a classical cyclic reference issue due to a traceback.
It's trivially reproducible on a Python prompt:
{code:python}
>>> import signal, gc, weakref
>>> gc.disable() # disable automatic cyclic reference collection
>>> class C: pass
...
>>> def h(*args): pass
...
>>> signal.signal(signal.SIGINT, h)
<built-in function default_int_handler>
>>>
>>> def f():
... global wr
... c = C()
... wr = weakref.ref(c)
... signal.getsignal(signal.SIGINT)
...
>>> f()
>>> wr() is None
False # object `c` is still alive
>>> gc.collect() # collect cyclic references
17
>>> wr() is None
True # object `c` has been collected
{code}
> [c++][python] Possibly memory not deallocated when reading in CSV
> -----------------------------------------------------------------
>
> Key: ARROW-13187
> URL: https://issues.apache.org/jira/browse/ARROW-13187
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 4.0.1
> Reporter: Simon
> Priority: Minor
> Attachments: backward-refs.png, forward-refs.png
>
>
> When one reads in a table from CSV in pyarrow version 4.0.1, it appears that
> the read-in table variable is not freed (or not fast enough). I'm unsure if
> this is because of pyarrow or because of the way pyarrow memory allocation
> interacts with Python memory allocation. I encountered it when processing
> many large CSVs sequentially.
> When I run the following piece of code, the RAM memory usage increases quite
> rapidly until it runs out of memory.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv
> # Generate some CSV file to read in
> print("Generating CSV")
> with open("example.csv", "w+") as f_out:
> for i in range(0, 10000000):
> f_out.write("123456789,abc def ghi jkl\n")
> def read_in_the_csv():
> table = pa.csv.read_csv("example.csv")
> print(table) # Not strictly necessary to replicate bug, table can also
> be an unused variable
> # This will free up the memory, as a workaround:
> # table = table.slice(0, 0)
> # Read in the CSV many times
> print("Reading in a CSV many times")
> for j in range(100000):
> read_in_the_csv()
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)