[ https://issues.apache.org/jira/browse/ARROW-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-1017: -------------------------------- Fix Version/s: 0.4.0 > Python: Calling to_pandas on a Parquet file in HDFS leaks memory > ---------------------------------------------------------------- > > Key: ARROW-1017 > URL: https://issues.apache.org/jira/browse/ARROW-1017 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.3.0 > Reporter: James Porritt > Fix For: 0.4.0 > > > Running the following code results in ever increasing memory usage, even > though I would expect the dataframe to be garbage collected when it goes out > of scope. For the size of my parquet file, I see the usage increasing about > 3GB per loop: > {code} > from pyarrow import HdfsClient > def read_parquet_file(client, parquet_file): > parquet = client.read_parquet(parquet_file) > df = parquet.to_pandas() > client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3') > parquet_file = '/my/parquet/file > while True: > read_parquet_file(client, parquet_file) > {code} > Is there a reference count issue similar to ARROW-362? -- This message was sent by Atlassian JIRA (v6.3.15#6346)