[jira] [Comment Edited] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

Wes McKinney (Jira) Sat, 19 Oct 2019 09:20:55 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955206#comment-16955206
 ]


Wes McKinney edited comment on ARROW-6910 at 10/19/19 4:19 PM:
---------------------------------------------------------------

I can confirm that setting the "dirty_page_ms" jemalloc option to 0 causes 
memory to be released to the OS right away. This is likely to reduce 
application performance, but it may make sense to make the default option but 
allow it to be configured at runtime. I'm working on a patch

see http://jemalloc.net/jemalloc.3.html


was (Author: wesmckinn):
I can confirm that setting the "dirty_page_ms" jemalloc option to 0 causes 
memory to be released to the OS right away. This is likely to reduce 
application performance, but it may make sense to make the default option but 
allow it to be configured at runtime. I'm working on a page

see http://jemalloc.net/jemalloc.3.html

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-6910
>                 URL: https://issues.apache.org/jira/browse/ARROW-6910
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.15.0
>            Reporter: V Luong
>            Priority: Critical
>             Fix For: 1.0.0, 0.15.1
>
>         Attachments: arrow6910.png
>
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from tqdm import tqdm
> from pyarrow.parquet import read_table
> PATH = '/tmp/big.snappy.parquet'
> for _ in tqdm(range(100)):
>     read_table(PATH, use_threads=False, memory_map=False)
>     (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> During the For loop above, if you view the memory usage (e.g. using htop 
> program), you'll see that it keeps creeping up. Either the program crashes 
> during the 100 iterations, or if the 100 iterations complete, the program 
> will still occupy a huge amount of memory, although no objects are kept. That 
> memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

Reply via email to