[jira] [Commented] (ARROW-11007) [Python] Memory leak in pq.read_table and table.to_pandas

Joris Van den Bossche (Jira) Mon, 04 Jan 2021 04:06:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258165#comment-17258165
 ]


Joris Van den Bossche commented on ARROW-11007:
-----------------------------------------------

Given that this comes up from time to time, it might be useful to document this 
to some extent: expectations around watching memory usage (explaining that the 
deallocated memory might be cached by the memory allocator etc), how you can 
actually see how much memory is used (total_allocated_bytes), how you can check 
if high "apparent" memory usage is indeed related to this and not caused by a 
memory leak (use pa.jemalloc_set_decay_ms(0); run your function many times in a 
loop and see it stabilizes or keeps constantly growing), ..

> [Python] Memory leak in pq.read_table and table.to_pandas
> ---------------------------------------------------------
>
>                 Key: ARROW-11007
>                 URL: https://issues.apache.org/jira/browse/ARROW-11007
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Michael Peleshenko
>            Priority: Major
>
> While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we 
> observed a memory leak in the read_table and to_pandas methods. See below for 
> sample code to reproduce it. Memory does not seem to be returned after 
> deleting the table and df as it was in pyarrow 0.12.1.
> *Sample Code*
> {code:python}
> import io
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from memory_profiler import profile
> @profile
> def read_file(f):
>     table = pq.read_table(f)
>     df = table.to_pandas(strings_to_categorical=True)
>     del table
>     del df
> def main():
>     rows = 2000000
>     df = pd.DataFrame({
>         "string": ["test"] * rows,
>         "int": [5] * rows,
>         "float": [2.0] * rows,
>     })
>     table = pa.Table.from_pandas(df, preserve_index=False)
>     parquet_stream = io.BytesIO()
>     pq.write_table(table, parquet_stream)
>     for i in range(3):
>         parquet_stream.seek(0)
>         read_file(parquet_stream)
> if __name__ == '__main__':
>     main()
> {code}
> *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    161.7 MiB    161.7 MiB           1   @profile
>     10                                         def read_file(f):
>     11    212.1 MiB     50.4 MiB           1       table = pq.read_table(f)
>     12    258.2 MiB     46.1 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    258.2 MiB      0.0 MiB           1       del table
>     14    256.3 MiB     -1.9 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    256.3 MiB    256.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    279.2 MiB     23.0 MiB           1       table = pq.read_table(f)
>     12    322.2 MiB     43.0 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    322.2 MiB      0.0 MiB           1       del table
>     14    320.3 MiB     -1.9 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    320.3 MiB    320.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    326.9 MiB      6.5 MiB           1       table = pq.read_table(f)
>     12    361.7 MiB     34.8 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    361.7 MiB      0.0 MiB           1       del table
>     14    359.8 MiB     -1.9 MiB           1       del df
> {code}
> *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    138.4 MiB    138.4 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.2 MiB     47.8 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     33.0 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    171.7 MiB    -47.5 MiB           1       del table
>     14    139.3 MiB    -32.4 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    139.3 MiB    139.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.8 MiB     47.5 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     32.4 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    171.5 MiB    -47.7 MiB           1       del table
>     14    139.1 MiB    -32.4 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    139.1 MiB    139.1 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.8 MiB     47.7 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     32.4 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    171.8 MiB    -47.5 MiB           1       del table
>     14    139.3 MiB    -32.4 MiB           1       del df
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11007) [Python] Memory leak in pq.read_table and table.to_pandas

Reply via email to