Sergey Mozharov created ARROW-6874: -------------------------------------- Summary: Memory leak in Table.to_pandas() when nested columns are present Key: ARROW-6874 URL: https://issues.apache.org/jira/browse/ARROW-6874 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.0 Environment: Operating system: Windows 10 pyarrow installed via conda both python environments were identical except pyarrow: python: 3.6.7 numpy: 1.17.2 pandas: 0.25.1 Reporter: Sergey Mozharov
I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python interpreter ran out of memory. I narrowed the issue down to the pyarrow.Table.to_pandas() call, which appears to have a memory leak in the latest version. See details below to reproduce this issue. {code:java} import numpy as np import pandas as pd import pyarrow as pa # create a table with one nested array column nested_array = pa.array([np.random.rand(1000) for i in range(500)]) nested_array.type # ListType(list<item: double>) table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays']) # convert it to a pandas DataFrame in a loop to monitor memory consumption num_iterations = 10000 # pyarrow v0.14.1: Memory allocation does not grow during loop execution # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected for i in range(num_iterations): df = pa.Table.to_pandas(table) # When the table column is not nested, no memory leak is observed array = pa.array(np.random.rand(500 * 1000)) table = pa.Table.from_arrays(arrays=[array], names=['numbers']) # no memory leak: for i in range(num_iterations): df = pa.Table.to_pandas(table){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)