[
https://issues.apache.org/jira/browse/ARROW-16037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513470#comment-17513470
]
Will Jones commented on ARROW-16037:
------------------------------------
Most likely, what you are seeing is memory being held onto by the memory pool
for future re-use. Arrow uses Jemalloc by default on Linux. You can check this
by running {{os.environ['ARROW_DEFAULT_MEMORY_POOL'] = 'system'}} prior to your
code snippet, which should result in more consistent memory usage.
Could you try that? And if you still think you see a memory leak, could you
share the output you are seeing from the above snippet?
> Possible memory leak in compute.take
> ------------------------------------
>
> Key: ARROW-16037
> URL: https://issues.apache.org/jira/browse/ARROW-16037
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 6.0.1
> Environment: Ubuntu
> Reporter: Ziheng Wang
> Priority: Blocker
>
> If you run the following code, the memory usage of the process goes up to 1GB
> even though the pyarrow allocated bytes is always at ~80MB. The process
> memory comes down after a while to 800 MB, but is still way more than what is
> necessary.
> '''
> import pyarrow as pa
> import numpy as np
> import pandas as pd
> import os, psutil
> import pyarrow.compute as compute
> import gc
> my_table =
> pa.Table.from_pandas(pd.DataFrame(np.random.normal(size=(10000,1000))))
> process = psutil.Process(os.getpid())
> print("mem usage", process.memory_info().rss, pa.total_allocated_bytes())
> for i in range(100):
> print("mem usage", process.memory_info().rss, pa.total_allocated_bytes())
> temp = compute.sort_indices(my_table['0'], sort_keys =
> [('0','ascending')])
> my_table = my_table.take(temp)
> gc.collect()
> '''
--
This message was sent by Atlassian Jira
(v8.20.1#820001)