[
https://issues.apache.org/jira/browse/ARROW-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447365#comment-17447365
]
Antoine Pitrou commented on ARROW-14790:
----------------------------------------
Two things:
1) {{sys.getrefcount(object)}} in Python will not tell you anything actually
useful
2) this does not look like a memory leak since memory consumption seems to
reach a fixed point. Modern memory allocators are complex and they don't
necessarily return memory {{to the system}} when the memory is freed, because
it can be costly. Instead, they use heuristics and keep some freed memory as a
cache for future allocations.
PyArrow exposes an API to try and return released memory to the system. It is
best-effort since it relies on how the underlying allocator (e.g. jemalloc)
works:
{code:python}
>>> pool = pa.default_memory_pool()
>>> pool.release_unused()
{code}
You may try that in your script. I don't know if we expose the same API in
Ruby. [~kou]
> Memory leak when reading CSV files
> ----------------------------------
>
> Key: ARROW-14790
> URL: https://issues.apache.org/jira/browse/ARROW-14790
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: Sten Larsson
> Priority: Major
>
> We're having problem with a memory leak in a Ruby script that processes many
> CSV files. I have written some short scripts do demonstrate the problem:
> [https://gist.github.com/stenlarsson/60b1e4e99416738b41ee30e7ba294214]
> The first script,
> [arrow_test_csv.rb|https://gist.github.com/stenlarsson/60b1e4e99416738b41ee30e7ba294214#file-arrow_test_csv-rb],
> creates a 184 MB CSV file for testing.
> The second script,
> [arrow_memory_leak.rb|https://gist.github.com/stenlarsson/60b1e4e99416738b41ee30e7ba294214#file-arrow_memory_leak-rb],
> then loads the CSV file 10 times using Arrow. It uses the
> [get_process_mem|https://rubygems.org/gems/get_process_mem] gem to print the
> memory usage both before and after each iteration. It also invokes the
> garbage collector on each iteration to ensure the problem is not that Ruby
> holds on to any objects. This is what it prints on my MacBook Pro using Arrow
> 6.0.0:
> {noformat}
> 127577 objects, 34.234375 MB
> 127577 objects, 347.625 MB
> 127577 objects, 438.7890625 MB
> 127577 objects, 457.6953125 MB
> 127577 objects, 469.8046875 MB
> 127577 objects, 480.88671875 MB
> 127577 objects, 487.96484375 MB
> 127577 objects, 493.8359375 MB
> 127577 objects, 497.671875 MB
> 127577 objects, 498.55859375 MB
> 127577 objects, 501.42578125 MB
> {noformat}
> The third script, [arrow_memory_leak.py
> |https://gist.github.com/stenlarsson/60b1e4e99416738b41ee30e7ba294214#file-arrow_memory_leak-py]
> is a Python implementation of the same script. This shows that the problem
> is not in the Ruby bindings:
> {noformat}
> 2106 objects, 31.75390625 MB
> 2106 objects, 382.28515625 MB
> 2106 objects, 549.41796875 MB
> 2106 objects, 656.78125 MB
> 2106 objects, 679.6875 MB
> 2106 objects, 691.9921875 MB
> 2106 objects, 708.73828125 MB
> 2106 objects, 717.296875 MB
> 2106 objects, 724.390625 MB
> 2106 objects, 729.19921875 MB
> 2106 objects, 734.47265625 MB
> {noformat}
> I have also tested Arrow 5.0.0 and it has the same problem.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)