[ 
https://issues.apache.org/jira/browse/ARROW-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171221#comment-17171221
 ] 

jesse ventura commented on ARROW-6976:
--------------------------------------

why so many related issues been closed? this problem keeping its existence in 
0.17 and 1.0.0 !!!

here my code , especially in multiprocessing :
{code:java}
// code placeholder
# ==========================================
#!/usr/bin/env python
# coding: utf-8
from memory_profiler import profile
from concurrent.futures import ProcessPoolExecutor, as_completed
import os,sys
import pandas as pd
from progressbar import ProgressBar

pd.set_option('expand_frame_repr', False)
pd.set_option('display.max_rows', 10)

def get_all_df_dict():
 return pd.read_parquet('xx.parquet',
                        # engine='fastparquet'   fastpq does NOT have memory 
leak PROBLEM
                        )

alldf_dict = get_all_df_dict()
@profile
def parse_single_tday(tday_str, ):
 df = pd.read_parquet('{}.parquet'.format(tday_str),
 # engine='fastparquet'
 )
 print(df.head())

@profile
def main():
 # yyyymmdd.parquet
 datdir = '/everydat_data/'
tday_strs = [item.split('.parquet')[0] for item in os.listdir(datdir)]
wkn = 3
with ProcessPoolExecutor(max_workers=wkn, ) as exe:
 plist = [exe.submit( parse_single_tday, tday, ) for tday in tday_strs]
 res = [task.result() for task in 
ProgressBar(max_value=len(plist))(as_completed(plist))]
main()
{code}
 

when subprocess finished , parquet dataframe loaded by pyarrow couldn't release 
memory..

 

!Screenshot_2020-08-05_10-11-45.png!

> Possible memory leak in pyarrow read_parquet
> --------------------------------------------
>
>                 Key: ARROW-6976
>                 URL: https://issues.apache.org/jira/browse/ARROW-6976
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.0
>         Environment: linux ubuntu 18.04
>            Reporter: david cottrell
>            Priority: Critical
>         Attachments: image-2019-10-23-16-17-20-739.png, pyarrow-master.png, 
> pyarrow_0150.png
>
>
>  
> Version and repro info in the gist below.
> Not sure if I'm not understanding something from this 
> [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/]
> but there seems to be memory accumulation when that is exacerbated with 
> higher arity objects like strings and dates (not datetimes).
>  
> I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed 
> to "fix" or lessen the problem.
>  
> [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62]
>  
> Let me know if this post should go elsewhere.
> !image-2019-10-23-16-17-20-739.png!
>  
> {code:java}
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to