[ 
https://issues.apache.org/jira/browse/ARROW-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447365#comment-17447365
 ] 

Antoine Pitrou commented on ARROW-14790:
----------------------------------------

Two things:

1) {{sys.getrefcount(object)}} in Python will not tell you anything actually 
useful

2) this does not look like a memory leak since memory consumption seems to 
reach a fixed point. Modern memory allocators are complex and they don't 
necessarily return memory {{to the system}} when the memory is freed, because 
it can be costly. Instead, they use heuristics and keep some freed memory as a 
cache for future allocations.

PyArrow exposes an API to try and return released memory to the system. It is 
best-effort since it relies on how the underlying allocator (e.g. jemalloc) 
works:
{code:python}
>>> pool = pa.default_memory_pool()
>>> pool.release_unused()
{code}

You may try that in your script. I don't know if we expose the same API in 
Ruby. [~kou]

> Memory leak when reading CSV files
> ----------------------------------
>
>                 Key: ARROW-14790
>                 URL: https://issues.apache.org/jira/browse/ARROW-14790
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Sten Larsson
>            Priority: Major
>
> We're having problem with a memory leak in a Ruby script that processes many 
> CSV files. I have written some short scripts do demonstrate the problem: 
> [https://gist.github.com/stenlarsson/60b1e4e99416738b41ee30e7ba294214]
> The first script, 
> [arrow_test_csv.rb|https://gist.github.com/stenlarsson/60b1e4e99416738b41ee30e7ba294214#file-arrow_test_csv-rb],
>  creates a 184 MB CSV file for testing.
> The second script, 
> [arrow_memory_leak.rb|https://gist.github.com/stenlarsson/60b1e4e99416738b41ee30e7ba294214#file-arrow_memory_leak-rb],
>  then loads the CSV file 10 times using Arrow. It uses the 
> [get_process_mem|https://rubygems.org/gems/get_process_mem] gem to print the 
> memory usage both before and after each iteration. It also invokes the 
> garbage collector on each iteration to ensure the problem is not that Ruby 
> holds on to any objects. This is what it prints on my MacBook Pro using Arrow 
> 6.0.0:
> {noformat}
> 127577 objects, 34.234375 MB
> 127577 objects, 347.625 MB
> 127577 objects, 438.7890625 MB
> 127577 objects, 457.6953125 MB
> 127577 objects, 469.8046875 MB
> 127577 objects, 480.88671875 MB
> 127577 objects, 487.96484375 MB
> 127577 objects, 493.8359375 MB
> 127577 objects, 497.671875 MB
> 127577 objects, 498.55859375 MB
> 127577 objects, 501.42578125 MB
> {noformat}
> The third script, [arrow_memory_leak.py 
> |https://gist.github.com/stenlarsson/60b1e4e99416738b41ee30e7ba294214#file-arrow_memory_leak-py]
>  is a Python implementation of the same script. This shows that the problem 
> is not in the Ruby bindings:
> {noformat}
> 2106 objects, 31.75390625 MB
> 2106 objects, 382.28515625 MB
> 2106 objects, 549.41796875 MB
> 2106 objects, 656.78125 MB
> 2106 objects, 679.6875 MB
> 2106 objects, 691.9921875 MB
> 2106 objects, 708.73828125 MB
> 2106 objects, 717.296875 MB
> 2106 objects, 724.390625 MB
> 2106 objects, 729.19921875 MB
> 2106 objects, 734.47265625 MB
> {noformat}
> I have also tested Arrow 5.0.0 and it has the same problem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to