[
https://issues.apache.org/jira/browse/ARROW-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171226#comment-17171226
]
jesse ventura commented on ARROW-6874:
--------------------------------------
why so many related issues been closed? this problem keeping its existence in
0.17 and 1.0.0 !!!
here my code , especially in multiprocessing :
{code:java}
// code placeholder
# ==========================================
#!/usr/bin/env python
# coding: utf-8
from memory_profiler import profile
from concurrent.futures import ProcessPoolExecutor, as_completed
import os,sys
import pandas as pd
from progressbar import ProgressBar
pd.set_option('expand_frame_repr', False)
pd.set_option('display.max_rows', 10)
def get_all_df_dict():
return pd.read_parquet('xx.parquet',
# engine='fastparquet' fastpq does NOT have memory
leak PROBLEM
)
alldf_dict = get_all_df_dict()
@profile
def parse_single_tday(tday_str, ):
df = pd.read_parquet('{}.parquet'.format(tday_str),
# engine='fastparquet'
)
print(df.head())
@profile
def main():
# yyyymmdd.parquet
datdir = '/everydat_data/'
tday_strs = [item.split('.parquet')[0] for item in os.listdir(datdir)]
wkn = 3
with ProcessPoolExecutor(max_workers=wkn, ) as exe:
plist = [exe.submit( parse_single_tday, tday, ) for tday in tday_strs]
res = [task.result() for task in
ProgressBar(max_value=len(plist))(as_completed(plist))]
main()
{code}
when subprocess finished , parquet dataframe loaded by pyarrow couldn't release
memory..
!Screenshot_2020-08-05_10-11-45.png!
> [Python] Memory leak in Table.to_pandas() when conversion to object dtype
> -------------------------------------------------------------------------
>
> Key: ARROW-6874
> URL: https://issues.apache.org/jira/browse/ARROW-6874
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.15.0
> Environment: Operating system: Windows 10
> pyarrow installed via conda
> both python environments were identical except pyarrow:
> python: 3.6.7
> numpy: 1.17.2
> pandas: 0.25.1
> Reporter: Sergey Mozharov
> Assignee: Antoine Pitrou
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.15.1, 0.16.0
>
> Attachments: Screenshot_2020-08-05_10-11-45.png
>
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python
> interpreter ran out of memory.
> I narrowed the issue down to the pyarrow.Table.to_pandas() call, which
> appears to have a memory leak in the latest version. See details below to
> reproduce this issue.
>
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> # create a table with one nested array column
> nested_array = pa.array([np.random.rand(1000) for i in range(500)])
> nested_array.type # ListType(list<item: double>)
> table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
> # convert it to a pandas DataFrame in a loop to monitor memory consumption
> num_iterations = 10000
> # pyarrow v0.14.1: Memory allocation does not grow during loop execution
> # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table)
> # When the table column is not nested, no memory leak is observed
> array = pa.array(np.random.rand(500 * 1000))
> table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
> # no memory leak:
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)