[ 
https://issues.apache.org/jira/browse/ARROW-16037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513589#comment-17513589
 ] 

Will Jones commented on ARROW-16037:
------------------------------------

Okay interesting. So you don't see memory monotonically increasing, it seems to 
level out at about 800MB. I am unable to reproduce, at least with pyarrow 6.0.1 
on MacOS. What version of numpy and pandas are you using? And maybe worth 
trying setting the environment variable outside of Arrow (I forgot on some 
platforms you can't set to os.environ)

 
{code:bash}
# Before launching python
export ARROW_DEFAULT_MEMORY_POOL=system
{code}
 
{code:python}
import pyarrow as pa
import numpy as np
import pandas as pd
import os, psutil
import pyarrow.compute as compute
import gc
my_table = 
pa.Table.from_pandas(pd.DataFrame(np.random.normal(size=(10000,1000))))

print("using backend: {}".format(pa.default_memory_pool().backend_name))

process = psutil.Process(os.getpid())
print("mem usage {:,} {:,}".format(process.memory_info().rss, 
pa.total_allocated_bytes()))

for i in range(100):
    print("mem usage {:,} {:,}".format(process.memory_info().rss, 
pa.total_allocated_bytes()))
    temp = compute.sort_indices(my_table['0'], sort_keys = [('0','ascending')])
    my_table = my_table.take(temp)
    gc.collect()
{code}
Here is the output I get, (truncated, but very consistent throughout)
{code}
using backend: system
mem usage 256,311,296 80,080,000
mem usage 256,098,304 80,080,000
0
mem usage 256,819,200 81,360,000
0
mem usage 256,851,968 81,360,000
0
mem usage 256,917,504 81,360,000
0
mem usage 256,884,736 81,360,000
0
mem usage 256,950,272 81,360,000
0
mem usage 257,081,344 81,360,000
0
mem usage 256,704,512 81,360,000
0
mem usage 257,081,344 81,360,000
0
mem usage 257,212,416 81,360,000
{code}

> Possible memory leak in compute.take
> ------------------------------------
>
>                 Key: ARROW-16037
>                 URL: https://issues.apache.org/jira/browse/ARROW-16037
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 6.0.1
>         Environment: Ubuntu
>            Reporter: Ziheng Wang
>            Priority: Blocker
>
> If you run the following code, the memory usage of the process goes up to 1GB 
> even though the pyarrow allocated bytes is always at ~80MB. The process 
> memory comes down after a while to 800 MB, but is still way more than what is 
> necessary.
> '''
> import pyarrow as pa
> import numpy as np
> import pandas as pd
> import os, psutil
> import pyarrow.compute as compute
> import gc
> my_table = 
> pa.Table.from_pandas(pd.DataFrame(np.random.normal(size=(10000,1000))))
> process = psutil.Process(os.getpid())
> print("mem usage", process.memory_info().rss, pa.total_allocated_bytes())
> for i in range(100):
>     print("mem usage", process.memory_info().rss, pa.total_allocated_bytes())
>     temp = compute.sort_indices(my_table['0'], sort_keys = 
> [('0','ascending')])
>     my_table = my_table.take(temp)
>     gc.collect()
> '''



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to