[
https://issues.apache.org/jira/browse/ARROW-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009757#comment-16009757
]
Wes McKinney commented on ARROW-1017:
-------------------------------------
Update. The following leaks memory:
{code:language=python}
import pyarrow as pa
import numpy as np
def leak():
data = [pa.array(np.concatenate([np.random.randn(1000000)] * 100))]
table = pa.Table.from_arrays(data, ['foo'])
while True:
table.to_pandas()
{code}
this does not:
{code:language=python}
def leak():
data = [pa.array(np.concatenate([np.random.randn(1000000)] * 100))]
table = pa.Table.from_arrays(data, ['foo'])
while True:
table[0].to_pandas()
{code}
> Python: Table.to_pandas leaks memory
> ------------------------------------
>
> Key: ARROW-1017
> URL: https://issues.apache.org/jira/browse/ARROW-1017
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.3.0
> Reporter: James Porritt
> Assignee: Wes McKinney
> Fix For: 0.4.0
>
>
> Running the following code results in ever increasing memory usage, even
> though I would expect the dataframe to be garbage collected when it goes out
> of scope. For the size of my parquet file, I see the usage increasing about
> 3GB per loop:
> {code}
> from pyarrow import HdfsClient
> def read_parquet_file(client, parquet_file):
> parquet = client.read_parquet(parquet_file)
> df = parquet.to_pandas()
> client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3')
> parquet_file = '/my/parquet/file
> while True:
> read_parquet_file(client, parquet_file)
> {code}
> Is there a reference count issue similar to ARROW-362?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)