[ https://issues.apache.org/jira/browse/ARROW-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-6874: ----------------------------------------- Fix Version/s: 0.15.1 > Memory leak in Table.to_pandas() when nested columns are present > ---------------------------------------------------------------- > > Key: ARROW-6874 > URL: https://issues.apache.org/jira/browse/ARROW-6874 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.15.0 > Environment: Operating system: Windows 10 > pyarrow installed via conda > both python environments were identical except pyarrow: > python: 3.6.7 > numpy: 1.17.2 > pandas: 0.25.1 > Reporter: Sergey Mozharov > Priority: Major > Fix For: 0.15.1 > > > I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python > interpreter ran out of memory. > I narrowed the issue down to the pyarrow.Table.to_pandas() call, which > appears to have a memory leak in the latest version. See details below to > reproduce this issue. > > {code:java} > import numpy as np > import pandas as pd > import pyarrow as pa > # create a table with one nested array column > nested_array = pa.array([np.random.rand(1000) for i in range(500)]) > nested_array.type # ListType(list<item: double>) > table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays']) > # convert it to a pandas DataFrame in a loop to monitor memory consumption > num_iterations = 10000 > # pyarrow v0.14.1: Memory allocation does not grow during loop execution > # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected > for i in range(num_iterations): > df = pa.Table.to_pandas(table) > # When the table column is not nested, no memory leak is observed > array = pa.array(np.random.rand(500 * 1000)) > table = pa.Table.from_arrays(arrays=[array], names=['numbers']) > # no memory leak: > for i in range(num_iterations): > df = pa.Table.to_pandas(table){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)