[jira] [Commented] (ARROW-11007) [Python] Memory leak in pq.read_table and table.to_pandas

Dmitry Kashtanov (Jira) Sat, 06 Feb 2021 03:44:04 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280160#comment-17280160
 ]


Dmitry Kashtanov commented on ARROW-11007:
------------------------------------------

It looks like it's a zero-copy operation since after `pyarrow.Table` creation 
and before `pandas.DataFrame` creation, pyarrow reports zero prior memory 
allocation (both in Linux and MacOS):

 
{code:java}
Before pandas dataframe creation
PyArrow mem pool info: jemalloc backend, 0 allocated, 0 max allocated, 
PyArrow total allocated bytes: 0
{code}
So with this, it looks like we have the following container sequence:

 
 # a list of `pyarrow.RecordBatch`es backed by memory allocated by 
`google.protobuf`
 # `pyarrow.Table` backed by (most likely, exactly the same) memory allocated 
by `google.protobuf`
 # then, `pandas.DataFrame` backed by memory allocated by `pyarrow`
 # then, after a column drop, `pandas.DataFrame` backed by memory allocated by 
`pandas`/`numpy`

So my current assumption is that `google.protobuf` uses a memory allocator for 
Linux, different from the one used for MacOS. The former one can be Google's 
TCMalloc (which [is Linux 
only|https://github.com/google/tcmalloc/blob/master/docs/platforms.md]).

> [Python] Memory leak in pq.read_table and table.to_pandas
> ---------------------------------------------------------
>
>                 Key: ARROW-11007
>                 URL: https://issues.apache.org/jira/browse/ARROW-11007
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Michael Peleshenko
>            Priority: Major
>
> While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we 
> observed a memory leak in the read_table and to_pandas methods. See below for 
> sample code to reproduce it. Memory does not seem to be returned after 
> deleting the table and df as it was in pyarrow 0.12.1.
> *Sample Code*
> {code:python}
> import io
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from memory_profiler import profile
> @profile
> def read_file(f):
>     table = pq.read_table(f)
>     df = table.to_pandas(strings_to_categorical=True)
>     del table
>     del df
> def main():
>     rows = 2000000
>     df = pd.DataFrame({
>         "string": ["test"] * rows,
>         "int": [5] * rows,
>         "float": [2.0] * rows,
>     })
>     table = pa.Table.from_pandas(df, preserve_index=False)
>     parquet_stream = io.BytesIO()
>     pq.write_table(table, parquet_stream)
>     for i in range(3):
>         parquet_stream.seek(0)
>         read_file(parquet_stream)
> if __name__ == '__main__':
>     main()
> {code}
> *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    161.7 MiB    161.7 MiB           1   @profile
>     10                                         def read_file(f):
>     11    212.1 MiB     50.4 MiB           1       table = pq.read_table(f)
>     12    258.2 MiB     46.1 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    258.2 MiB      0.0 MiB           1       del table
>     14    256.3 MiB     -1.9 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    256.3 MiB    256.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    279.2 MiB     23.0 MiB           1       table = pq.read_table(f)
>     12    322.2 MiB     43.0 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    322.2 MiB      0.0 MiB           1       del table
>     14    320.3 MiB     -1.9 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    320.3 MiB    320.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    326.9 MiB      6.5 MiB           1       table = pq.read_table(f)
>     12    361.7 MiB     34.8 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    361.7 MiB      0.0 MiB           1       del table
>     14    359.8 MiB     -1.9 MiB           1       del df
> {code}
> *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    138.4 MiB    138.4 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.2 MiB     47.8 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     33.0 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    171.7 MiB    -47.5 MiB           1       del table
>     14    139.3 MiB    -32.4 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    139.3 MiB    139.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.8 MiB     47.5 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     32.4 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    171.5 MiB    -47.7 MiB           1       del table
>     14    139.1 MiB    -32.4 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    139.1 MiB    139.1 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.8 MiB     47.7 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     32.4 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    171.8 MiB    -47.5 MiB           1       del table
>     14    139.3 MiB    -32.4 MiB           1       del df
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11007) [Python] Memory leak in pq.read_table and table.to_pandas

Reply via email to