James Porritt created ARROW-1017: ------------------------------------ Summary: Python: Calling to_pandas on a Parquet file in HDFS leaks memory Key: ARROW-1017 URL: https://issues.apache.org/jira/browse/ARROW-1017 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.3.0 Reporter: James Porritt
Running the following code results in ever increasing memory usage, even though I would expect the dataframe to be garbage collected when it goes out of scope. For the size of my parquet file, I see the usage increasing about 3GB per loop: {code} from pyarrow import HdfsClient def read_parquet_file(client, parquet_file): parquet = client.read_parquet(parquet_file) df = parquet.to_pandas() client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3') parquet_file = '/my/parquet/file while True: read_parquet_file(client, parquet_file) {code} Is there a reference count issue similar to ARROW-362? -- This message was sent by Atlassian JIRA (v6.3.15#6346)