James Porritt created ARROW-1017:
------------------------------------

             Summary: Python: Calling to_pandas on a Parquet file in HDFS leaks 
memory
                 Key: ARROW-1017
                 URL: https://issues.apache.org/jira/browse/ARROW-1017
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.3.0
            Reporter: James Porritt


Running the following code results in ever increasing memory usage, even though 
I would expect the dataframe to be garbage collected when it goes out of scope. 
For the size of my parquet file, I see the usage increasing about 3GB per loop:

{code}
from pyarrow import HdfsClient

def read_parquet_file(client, parquet_file):
    parquet = client.read_parquet(parquet_file)
    df = parquet.to_pandas()

client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3')
parquet_file = '/my/parquet/file
while True:
    read_parquet_file(client, parquet_file)
{code}

Is there a reference count issue similar to ARROW-362?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to