[ 
https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-5302:
-------------------------
    Description: 
The following piece of code (running on a Linux, Python 3.6 from anaconda) 
demonstrates a memory leak when reading data from disk.
{code:java}
import resource

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


# some random data, some of them as array columns
path = 'data.parquet'
batches = 5000
df = pd.DataFrame({
    't': [list(range(0, 180 * 60, 5))] * batches,
})

pq.write_table(pa.Table.from_pandas(df), path)

table = pq.read_table(path)


# read the data above and convert it to json (e.g. the backend of a restful API)
for i in range(100):
    # comment any of the 2 lines for the leak to vanish.
    df = pq.read_table(path).to_pandas()
    df['t'].to_json()
    print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

{code}
Result :
{code:java}
481676
618584
755396
892156
1028892
1165660
1302428
1439184
1620376
1801340
...{code}
Relevant pip freeze:

pyarrow (0.13.0)

pandas (0.24.2)

 

Note: it is not entirely obvious that this is caused by pyarrow instead of 
pandas or numpy. I was only able to reproduce it through write/read from 
pyarrow.

 

  was:
The following piece of code (running on a Linux, Python 3.6 from anaconda) 
demonstrates a memory leak when reading data from disk.
{code:java}
import resource

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


# some random data, some of them as array columns
path = 'data.parquet'
batches = 5000
df = pd.DataFrame({
    't': [list(range(0, 180 * 60, 5))] * batches,
})

pq.write_table(pa.Table.from_pandas(df), path)

table = pq.read_table(path)


# read the data above and convert it to json (e.g. the backend of a restful API)
for i in range(100):
    # comment any of the 2 lines for the leak to vanish.
    df = pq.read_table(path).to_pandas()
    df['t'].to_json()
    print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

{code}
Result :
{code:java}
481676
618584
755396
892156
1028892
1165660
1302428
1439184
1620376
1801340
...{code}
Relevant pip freeze:

pyarrow (0.13.0)

pandas (0.24.2)

 


> Memory leak when read_table().to_pandas().to_json()
> ---------------------------------------------------
>
>                 Key: ARROW-5302
>                 URL: https://issues.apache.org/jira/browse/ARROW-5302
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>         Environment: Linux, Python 3.6.4 :: Anaconda, Inc.
>            Reporter: Jorge
>            Priority: Major
>              Labels: memory-leak
>
> The following piece of code (running on a Linux, Python 3.6 from anaconda) 
> demonstrates a memory leak when reading data from disk.
> {code:java}
> import resource
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> # some random data, some of them as array columns
> path = 'data.parquet'
> batches = 5000
> df = pd.DataFrame({
>     't': [list(range(0, 180 * 60, 5))] * batches,
> })
> pq.write_table(pa.Table.from_pandas(df), path)
> table = pq.read_table(path)
> # read the data above and convert it to json (e.g. the backend of a restful 
> API)
> for i in range(100):
>     # comment any of the 2 lines for the leak to vanish.
>     df = pq.read_table(path).to_pandas()
>     df['t'].to_json()
>     print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
> {code}
> Result :
> {code:java}
> 481676
> 618584
> 755396
> 892156
> 1028892
> 1165660
> 1302428
> 1439184
> 1620376
> 1801340
> ...{code}
> Relevant pip freeze:
> pyarrow (0.13.0)
> pandas (0.24.2)
>  
> Note: it is not entirely obvious that this is caused by pyarrow instead of 
> pandas or numpy. I was only able to reproduce it through write/read from 
> pyarrow.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to