[ 
https://issues.apache.org/jira/browse/ARROW-13187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17370513#comment-17370513
 ] 

Simon commented on ARROW-13187:
-------------------------------

Thank you for the prompt help. I'll for now then stick with the workaround.

As a sidenote, it seems it might take quite time before Python 3.8.10 is in use 
by the major distributions (with e.g., Ubuntu currently at 3.8.5).
Given that pyarrow is used for "large objects" as referred to in 
https://bugs.python.org/issue42248 by Gerald Dalley, it might be worthwhile to 
have it noted somewhere in pyarrow documentation till Python is upgraded more 
broadly? As reading in many large CSVs in sequence could be considered a common 
task.

> [c++][python] Possibly memory not deallocated when reading in CSV
> -----------------------------------------------------------------
>
>                 Key: ARROW-13187
>                 URL: https://issues.apache.org/jira/browse/ARROW-13187
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 4.0.1
>            Reporter: Simon
>            Priority: Minor
>         Attachments: backward-refs.png, forward-refs.png
>
>
> When one reads in a table from CSV in pyarrow version 4.0.1, it appears that 
> the read-in table variable is not freed (or not fast enough). I'm unsure if 
> this is because of pyarrow or because of the way pyarrow memory allocation 
> interacts with Python memory allocation. I encountered it when processing 
> many large CSVs sequentially.
> When I run the following piece of code, the RAM memory usage increases quite 
> rapidly until it runs out of memory.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv
> # Generate some CSV file to read in
> print("Generating CSV")
> with open("example.csv", "w+") as f_out:
>     for i in range(0, 10000000):
>         f_out.write("123456789,abc def ghi jkl\n")
> def read_in_the_csv():
>     table = pa.csv.read_csv("example.csv")
>     print(table)  # Not strictly necessary to replicate bug, table can also 
> be an unused variable
>     # This will free up the memory, as a workaround:
>     # table = table.slice(0, 0)
> # Read in the CSV many times
> print("Reading in a CSV many times")
> for j in range(100000):
>     read_in_the_csv()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to